Scalding - the not-so-basics @ ScalaDays 2014

7,307 views

Published on

Some more in depth tips about writing and optimising Scalding Map Reduce Jobs

Published in: Technology

Scalding - the not-so-basics @ ScalaDays 2014

  1. 1. Scalding the not-so-basics Konrad 'ktoso' Malawski Scala Days 2014 @ Berlin
  2. 2. Konrad `@ktosopl` Malawski typesafe.com geecon.org Java.pl / KrakowScala.pl sckrk.com / meetup.com/Paper-Cup @ London GDGKrakow.pl meetup.com/Lambda-Lounge-Krakow hAkker @
  3. 3. http://hadoop.apache.org/ http://research.google.com/archive/mapreduce.html How old is this guy?
  4. 4. http://hadoop.apache.org/ http://research.google.com/archive/mapreduce.html Google MapReduce, paper: 2004 Hadoop (Yahoo impl): 2005
  5. 5. the Big Landscape
  6. 6. Hadoop
  7. 7. https://github.com/twitter/scalding Scalding is “on top of” Hadoop
  8. 8. https://github.com/twitter/scalding Scalding is “on top of” Cascading, which is “on top of” Hadoop http://www.cascading.org/
  9. 9. https://github.com/twitter/scalding Summingbird is “op top of” Scalding, which is “on top of” Cascading, which is “on top of” Hadoop http://www.cascading.org/ https://github.com/twitter/summingbird
  10. 10. https://github.com/twitter/scalding Summingbird is “op top of” Scalding or Storm, which is “on top of” Cascading, which is “on top of” Hadoop http://www.cascading.org/ https://github.com/twitter/summingbird http://storm.incubator.apache.org/
  11. 11. https://github.com/twitter/scalding Summingbird is “op top of” Scalding or Storm, which is “on top of” Cascading, which is “on top of” Hadoop; Spark is a bit “separate” currently. http://www.cascading.org/ https://github.com/twitter/summingbird http://storm.incubator.apache.org/ http://spark.apache.org/
  12. 12. https://github.com/twitter/scalding Summingbird is “op top of” Scalding or Storm, which is “on top of” Cascading, which is “on top of” Hadoop; Spark is a bit “separate” currently. http://www.cascading.org/ https://github.com/twitter/summingbird http://storm.incubator.apache.org/ http://spark.apache.org/ HDFS yes, MapReduce no
  13. 13. https://github.com/twitter/scalding Summingbird is “op top of” Scalding or Storm, which is “on top of” Cascading, which is “on top of” Hadoop; Spark is a bit “separate” currently. http://www.cascading.org/ https://github.com/twitter/summingbird http://storm.incubator.apache.org/ http://spark.apache.org/
  14. 14. https://github.com/twitter/scalding Summingbird is “op top of” Scalding or Storm, which is “on top of” Cascading, which is “on top of” Hadoop; Spark is a bit “separate” currently. http://www.cascading.org/ https://github.com/twitter/summingbird http://storm.incubator.apache.org/ http://spark.apache.org/ HDFS yes, MapReduce no
  15. 15. https://github.com/twitter/scalding Summingbird is “op top of” Scalding or Storm, which is “on top of” Cascading, which is “on top of” Hadoop; Spark is a bit “separate” currently. http://www.cascading.org/ https://github.com/twitter/summingbird http://storm.incubator.apache.org/ http://spark.apache.org/ HDFS yes, MapReduce no Possibly soon?!
  16. 16. https://github.com/twitter/scalding Summingbird is “op top of” Scalding or Storm, which is “on top of” Cascading, which is “on top of” Hadoop; Spark has nothing to do with all this. http://www.cascading.org/ https://github.com/twitter/summingbird http://storm.incubator.apache.org/ http://spark.apache.org/ -streams
  17. 17. https://github.com/twitter/scalding Summingbird is “op top of” Scalding or Storm, which is “on top of” Cascading, which is “on top of” Hadoop http://www.cascading.org/ https://github.com/twitter/summingbird http://storm.incubator.apache.org/ http://spark.apache.org/ this talk
  18. 18. Why?
  19. 19. Stuff > Memory Scala collections... fun but, memory bound! val text = "so many words... waaah! ..."! ! ! text! .split(" ")! .map(a => (a, 1))! .groupBy(_._1)! .map(a => (a._1, a._2.map(_._2).sum))!
  20. 20. Stuff > Memory Scala collections... fun but, memory bound! val text = "so many words... waaah! ..."! ! ! text! .split(" ")! .map(a => (a, 1))! .groupBy(_._1)! .map(a => (a._1, a._2.map(_._2).sum))! in Memory
  21. 21. Stuff > Memory Scala collections... fun but, memory bound! val text = "so many words... waaah! ..."! ! ! text! .split(" ")! .map(a => (a, 1))! .groupBy(_._1)! .map(a => (a._1, a._2.map(_._2).sum))! in Memory in Memory
  22. 22. Stuff > Memory Scala collections... fun but, memory bound! val text = "so many words... waaah! ..."! ! ! text! .split(" ")! .map(a => (a, 1))! .groupBy(_._1)! .map(a => (a._1, a._2.map(_._2).sum))! in Memory in Memory in Memory
  23. 23. Stuff > Memory Scala collections... fun but, memory bound! val text = "so many words... waaah! ..."! ! ! text! .split(" ")! .map(a => (a, 1))! .groupBy(_._1)! .map(a => (a._1, a._2.map(_._2).sum))! in Memory in Memory in Memory in Memory
  24. 24. Stuff > Memory Scala collections... fun but, memory bound! val text = "so many words... waaah! ..."! ! ! text! .split(" ")! .map(a => (a, 1))! .groupBy(_._1)! .map(a => (a._1, a._2.map(_._2).sum))! in Memory in Memory in Memory in Memory in Memory
  25. 25. package org.myorg;! ! import org.apache.hadoop.fs.Path;! import org.apache.hadoop.io.IntWritable;! import org.apache.hadoop.io.LongWritable;! import org.apache.hadoop.io.Text;! import org.apache.hadoop.mapred.*;! ! import java.io.IOException;! import java.util.Iterator;! import java.util.StringTokenizer;! ! public class WordCount {! ! public static class Map extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> {! private final static IntWritable one = new IntWritable(1);! private Text word = new Text();! ! public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) thro IOException {! String line = value.toString();! StringTokenizer tokenizer = new StringTokenizer(line);! while (tokenizer.hasMoreTokens()) {! word.set(tokenizer.nextToken());! output.collect(word, one);! Why Scalding? Word Count in Hadoop MR
  26. 26. private final static IntWritable one = new IntWritable(1);! private Text word = new Text();! ! public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) thro IOException {! String line = value.toString();! StringTokenizer tokenizer = new StringTokenizer(line);! while (tokenizer.hasMoreTokens()) {! word.set(tokenizer.nextToken());! output.collect(word, one);! }! }! }! ! public static class Reduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> {! public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException {! int sum = 0;! while (values.hasNext()) {! sum += values.next().get();! }! output.collect(key, new IntWritable(sum));! }! }! ! public static void main(String[] args) throws Exception {! JobConf conf = new JobConf(WordCount.class);! conf.setJobName("wordcount");! ! conf.setOutputKeyClass(Text.class);! conf.setOutputValueClass(IntWritable.class);! ! conf.setMapperClass(Map.class);! conf.setCombinerClass(Reduce.class);! conf.setReducerClass(Reduce.class);! ! conf.setInputFormat(TextInputFormat.class);! conf.setOutputFormat(TextOutputFormat.class);! ! FileInputFormat.setInputPaths(conf, new Path(args[0]));! FileOutputFormat.setOutputPath(conf, new Path(args[1]));! ! JobClient.runJob(conf);! }! }! Why Scalding? Word Count in Hadoop MR
  27. 27. “Field API”
  28. 28. map
  29. 29. val data = 1 :: 2 :: 3 :: Nil val doubled = data map { _ * 2 } map Scala:
  30. 30. val data = 1 :: 2 :: 3 :: Nil val doubled = data map { _ * 2 } map Scala:
  31. 31. val data = 1 :: 2 :: 3 :: Nil val doubled = data map { _ * 2 } map IterableSource(data) .map('number -> 'doubled) { n: Int => n * 2 } Scala:
  32. 32. val data = 1 :: 2 :: 3 :: Nil val doubled = data map { _ * 2 } map IterableSource(data) .map('number -> 'doubled) { n: Int => n * 2 } Scala: available in Pipe
  33. 33. val data = 1 :: 2 :: 3 :: Nil val doubled = data map { _ * 2 } map IterableSource(data) .map('number -> 'doubled) { n: Int => n * 2 } Scala: available in Pipestays in Pipe
  34. 34. val data = 1 :: 2 :: 3 :: Nil! ! val doubled = data map { _ * 2 }! ! // Int => Int map IterableSource(data)! .map('number -> 'doubled) { n: Int => n * 2 }! ! ! // Int => Int Scala: must choose type!
  35. 35. mapTo
  36. 36. var data = 1 :: 2 :: 3 :: Nil val doubled = data map { _ * 2 } data = null mapTo Scala:
  37. 37. var data = 1 :: 2 :: 3 :: Nil val doubled = data map { _ * 2 } data = null mapTo Scala: “release reference”
  38. 38. var data = 1 :: 2 :: 3 :: Nil val doubled = data map { _ * 2 } data = null mapTo Scala: “release reference”
  39. 39. var data = 1 :: 2 :: 3 :: Nil val doubled = data map { _ * 2 } data = null mapTo IterableSource(data) .mapTo('doubled) { n: Int => n * 2 } Scala: “release reference”
  40. 40. var data = 1 :: 2 :: 3 :: Nil val doubled = data map { _ * 2 } data = null mapTo IterableSource(data) .mapTo('doubled) { n: Int => n * 2 } Scala: doubled stays in Pipe “release reference”
  41. 41. var data = 1 :: 2 :: 3 :: Nil val doubled = data map { _ * 2 } data = null mapTo IterableSource(data) .mapTo('doubled) { n: Int => n * 2 } Scala: doubled stays in Pipenumber is removed “release reference”
  42. 42. flatMap
  43. 43. val data = "1" :: "2,2" :: "3,3,3" :: Nil // List[String] val numbers = data flatMap { line => // String line.split(",") // Array[String] } map { _.toInt } // List[Int] flatMap Scala:
  44. 44. val data = "1" :: "2,2" :: "3,3,3" :: Nil // List[String] val numbers = data flatMap { line => // String line.split(",") // Array[String] } map { _.toInt } // List[Int] flatMap Scala:
  45. 45. val data = "1" :: "2,2" :: "3,3,3" :: Nil // List[String] val numbers = data flatMap { line => // String line.split(",") // Array[String] } map { _.toInt } // List[Int] flatMap TextLine(data) // like List[String] .flatMap('line -> 'word) { _.split(",") } // like List[String] .map('word -> 'number) { _.toInt } // like List[Int] Scala:
  46. 46. flatMap
  47. 47. val data = "1" :: "2,2" :: "3,3,3" :: Nil // List[String] val numbers = data flatMap { line => // String line.split(",").map(_.toInt) // Array[Int] } flatMap Scala:
  48. 48. val data = "1" :: "2,2" :: "3,3,3" :: Nil // List[String] val numbers = data flatMap { line => // String line.split(",").map(_.toInt) // Array[Int] } flatMap TextLine(data) // like List[String] .flatMap('line -> 'word) { _.split(",").map(_.toInt) } // like List[Int] Scala:
  49. 49. groupBy
  50. 50. val data = 1 :: 2 :: 30 :: 42 :: Nil // List[Int] val groups = data groupBy { _ < 10 } groups // Map[Boolean, Int] groupBy Scala:
  51. 51. val data = 1 :: 2 :: 30 :: 42 :: Nil // List[Int] val groups = data groupBy { _ < 10 } groups // Map[Boolean, Int] groupBy Scala:
  52. 52. val data = 1 :: 2 :: 30 :: 42 :: Nil // List[Int] val groups = data groupBy { _ < 10 } groups // Map[Boolean, Int] groupBy IterableSource(List(1, 2, 30, 42), 'num) .map('num -> 'lessThanTen) { i: Int => i < 10 } .groupBy('lessThanTen) { _.size } Scala:
  53. 53. val data = 1 :: 2 :: 30 :: 42 :: Nil // List[Int] val groups = data groupBy { _ < 10 } groups // Map[Boolean, Int] groupBy IterableSource(List(1, 2, 30, 42), 'num) .map('num -> 'lessThanTen) { i: Int => i < 10 } .groupBy('lessThanTen) { _.size } Scala: groups all with == value
  54. 54. val data = 1 :: 2 :: 30 :: 42 :: Nil // List[Int] val groups = data groupBy { _ < 10 } groups // Map[Boolean, Int] groupBy IterableSource(List(1, 2, 30, 42), 'num) .map('num -> 'lessThanTen) { i: Int => i < 10 } .groupBy('lessThanTen) { _.size } Scala: groups all with == value 'lessThanTenCounts
  55. 55. groupBy
  56. 56. groupBy IterableSource(List(1, 2, 30, 42), 'num)
  57. 57. groupBy IterableSource(List(1, 2, 30, 42), 'num) .map('num -> 'lessThanTen) { i: Int => i < 10 }
  58. 58. groupBy IterableSource(List(1, 2, 30, 42), 'num) .map('num -> 'lessThanTen) { i: Int => i < 10 } .groupBy('lessThanTen) { _.sum('total) }
  59. 59. groupBy IterableSource(List(1, 2, 30, 42), 'num) .map('num -> 'lessThanTen) { i: Int => i < 10 } .groupBy('lessThanTen) { _.sum('total) } 'total = [3, 74]
  60. 60. import org.apache.hadoop.util.ToolRunner! import com.twitter.scalding! ! object ScaldingJobRunner extends App {! ! ToolRunner.run(new Configuration, new scalding.Tool, args)! ! } Main Class - "Runner"
  61. 61. import org.apache.hadoop.util.ToolRunner! import com.twitter.scalding! ! object ScaldingJobRunner extends App {! ! ToolRunner.run(new Configuration, new scalding.Tool, args)! ! } Main Class - "Runner" from App
  62. 62. class WordCountJob(args: Args) extends Job(args) {! ! ! ! ! ! ! ! ! ! ! } Word Count in Scalding
  63. 63. class WordCountJob(args: Args) extends Job(args) {! ! val inputFile = args("input")! val outputFile = args("output")! ! ! ! ! ! ! ! } Word Count in Scalding
  64. 64. class WordCountJob(args: Args) extends Job(args) {! ! val inputFile = args("input")! val outputFile = args("output")! ! TextLine(inputFile)! ! ! ! ! ! } Word Count in Scalding
  65. 65. class WordCountJob(args: Args) extends Job(args) {! ! val inputFile = args("input")! val outputFile = args("output")! ! TextLine(inputFile)! .flatMap('line -> 'word) { line: String => tokenize(line) }! ! ! ! def tokenize(text: String): Array[String] = implemented! } Word Count in Scalding
  66. 66. class WordCountJob(args: Args) extends Job(args) {! ! val inputFile = args("input")! val outputFile = args("output")! ! TextLine(inputFile)! .flatMap('line -> 'word) { line: String => tokenize(line) }! .groupBy('word) { group => group.size('count) }! ! ! def tokenize(text: String): Array[String] = implemented! } Word Count in Scalding
  67. 67. class WordCountJob(args: Args) extends Job(args) {! ! val inputFile = args("input")! val outputFile = args("output")! ! TextLine(inputFile)! .flatMap('line -> 'word) { line: String => tokenize(line) }! .groupBy('word) { group => group.size }! ! ! def tokenize(text: String): Array[String] = implemented! } Word Count in Scalding
  68. 68. class WordCountJob(args: Args) extends Job(args) {! ! val inputFile = args("input")! val outputFile = args("output")! ! TextLine(inputFile)! .flatMap('line -> 'word) { line: String => tokenize(line) }! .groupBy('word) { _.size }! ! ! def tokenize(text: String): Array[String] = implemented! } Word Count in Scalding
  69. 69. class WordCountJob(args: Args) extends Job(args) {! ! val inputFile = args("input")! val outputFile = args("output")! ! TextLine(inputFile)! .flatMap('line -> 'word) { line: String => tokenize(line) }! .groupBy('word) { _.size }! .write(Tsv(outputFile))! ! def tokenize(text: String): Array[String] = implemented! } Word Count in Scalding
  70. 70. class WordCountJob(args: Args) extends Job(args) {! ! val inputFile = args("input")! val outputFile = args("output")! ! TextLine(inputFile)! .flatMap('line -> 'word) { line: String => tokenize(line) }! .groupBy('word) { _.size }! .write(Tsv(outputFile))! ! def tokenize(text: String): Array[String] = implemented! } Word Count in Scalding 4{
  71. 71. 1 day in the life of a guy implementing Scalding jobs
  72. 72. “How much are my shops selling?” Tsv(input, ('shopId, 'itemId, 'itemName, 'quantity))! .groupBy('shopId) {! _.sum[Long]('quantity -> ‘totalSoldItems)! }! .write(Tsv(output))
  73. 73. “How much are my shops selling?” Tsv(input, ('shopId, 'itemId, 'itemName, 'quantity))! .groupBy('shopId) {! _.sum[Long]('quantity -> ‘totalSoldItems)! }! .write(Tsv(output)) 1!107! 2!144! 3!16! … …
  74. 74. “How much are my shops selling?” Tsv(input, ('shopId, 'itemId, 'itemName, 'quantity))! .groupBy('shopId) {! _.sum[Long]('quantity -> ‘totalSoldItems)! }! .write(Tsv(output, writeHeader = true))
  75. 75. “How much are my shops selling?” Tsv(input, ('shopId, 'itemId, 'itemName, 'quantity))! .groupBy('shopId) {! _.sum[Long]('quantity -> ‘totalSoldItems)! }! .write(Tsv(output, writeHeader = true)) shopId! totalSoldItems! 1!! ! ! 107! 2!! ! ! 144! 3!! ! ! 16! …!! ! ! …
  76. 76. “Which are the top selling shops?” Tsv(input, ('shopId, 'itemId, 'itemName, 'quantity))! .groupBy('shopId) {! _.sum[Long]('quantity -> ‘totalSoldItems)! }! .groupAll { _.sortBy('totalSoldItems).reverse }! .write(Tsv(output, writeHeader = true))
  77. 77. “Which are the top selling shops?” Tsv(input, ('shopId, 'itemId, 'itemName, 'quantity))! .groupBy('shopId) {! _.sum[Long]('quantity -> ‘totalSoldItems)! }! .groupAll { _.sortBy('totalSoldItems).reverse }! .write(Tsv(output, writeHeader = true)) shopId! totalSoldItems! 2!! ! ! 144 ! 1!! ! ! 107! 3!! ! ! 16! …!! ! ! …
  78. 78. “What’s the top 3 shops?” Tsv(input, ('shopId, 'itemId, 'itemName, 'quantity))! .groupBy('shopId) {! _.sum[Long]('quantity -> ‘totalSoldItems)! }! .groupAll { _.sortBy(‘totalSoldItems).reverse.take(3) }! .write(Tsv(output, writeHeader = true))
  79. 79. “What’s the top 3 shops?” Tsv(input, ('shopId, 'itemId, 'itemName, 'quantity))! .groupBy('shopId) {! _.sum[Long]('quantity -> ‘totalSoldItems)! }! .groupAll { _.sortBy(‘totalSoldItems).reverse.take(3) }! .write(Tsv(output, writeHeader = true)) shopId! totalSoldItems! 2!! ! ! 144 ! 1!! ! ! 107! 3!! ! ! 16
  80. 80. “What’s the top 3 shops?” Tsv(input, ('shopId, 'itemId, 'itemName, 'quantity))! .groupBy('shopId) {! _.sum[Long]('quantity -> ‘totalSoldItems)! }! .groupAll { _.sortBy(‘totalSoldItems).reverse.take(3) }! .write(Tsv(output, writeHeader = true)) shopId! totalSoldItems! 2!! ! ! 144 ! 1!! ! ! 107! 3!! ! ! 16 SLOW! Instead do sortWithTake!SLOW! Instead do sortWithTake!
  81. 81. “What’s the top 3 shops?” Tsv(input, ('shopId, 'itemId, 'itemName, 'quantity))! .groupBy('shopId) {! _.sum[Long]('quantity -> ‘totalSoldItems)! }! .groupAll { ! _.sortedReverseTake[Long]('totalSold -> 'x, 3) ! }! .write(Tsv(output, writeHeader = true))
  82. 82. “What’s the top 3 shops?” Tsv(input, ('shopId, 'itemId, 'itemName, 'quantity))! .groupBy('shopId) {! _.sum[Long]('quantity -> ‘totalSoldItems)! }! .groupAll { ! _.sortedReverseTake[Long]('totalSold -> 'x, 3) ! }! .write(Tsv(output, writeHeader = true)) x! List((5,146), (2,142), (3,32))!
  83. 83. “What’s the top 3 shops?” Tsv(input, ('shopId, 'itemId, 'itemName, 'quantity))! .groupBy('shopId) {! _.sum[Long]('quantity -> ‘totalSoldItems)! }! .groupAll { ! _.sortedReverseTake[Long]('totalSold -> 'x, 3) ! }! .write(Tsv(output, writeHeader = true)) x! List((5,146), (2,142), (3,32))! WAT!?
  84. 84. “What’s the top 3 shops?” Tsv(input, ('shopId, 'itemId, 'itemName, 'quantity))! .groupBy('shopId) {! _.sum[Long]('quantity -> ‘totalSoldItems)! }! .groupAll { ! _.sortedReverseTake[Long]('totalSold -> 'x, 3) ! }! .write(Tsv(output, writeHeader = true)) x! List((5,146), (2,142), (3,32))! WAT!? Emits scala.collection.List[_]
  85. 85. “What’s the top 3 shops?” Tsv(input, ('shopId, 'itemId, 'itemName, 'quantity))! .groupBy('shopId) {! _.sum[Long]('quantity -> ‘totalSoldItems)! }! .groupAll { ! _.sortWithTake(('shopId, 'totalSold) -> 'x, 3) { ! (l: (Long, Long), r: (Long, Long)) => ! l._2 < l._2 ! }! }! .flatMapTo('x -> ('shopId, 'totalSold)) { ! x: List[(Long, Long)] => x! }! .write(Tsv(output, writeHeader = true))
  86. 86. “What’s the top 3 shops?” Tsv(input, ('shopId, 'itemId, 'itemName, 'quantity))! .groupBy('shopId) {! _.sum[Long]('quantity -> ‘totalSoldItems)! }! .groupAll { ! _.sortWithTake(('shopId, 'totalSold) -> 'x, 3) { ! (l: (Long, Long), r: (Long, Long)) => ! l._2 < l._2 ! }! }! .flatMapTo('x -> ('shopId, 'totalSold)) { ! x: List[(Long, Long)] => x! }! .write(Tsv(output, writeHeader = true)) Provide Ordering explicitly because implicit Ordering is not enough for Tuple2 here
  87. 87. Tsv(input, ('shopId, 'itemId, 'itemName, 'quantity))! .groupBy('shopId) {! _.sum[Long]('quantity -> ‘totalSoldItems)! }! .groupAll { ! _.sortWithTake(('shopId, 'totalSold) -> 'x, 3) { ! (l: (Long, Long), r: (Long, Long)) => ! l._2 < l._2 ! }! }! .flatMapTo('x -> ('shopId, 'totalSold)) { ! x: List[(Long, Long)] => x! }! .write(Tsv(output, writeHeader = true)) “What’s the top 3 shops?”
  88. 88. Tsv(input, ('shopId, 'itemId, 'itemName, 'quantity))! .groupBy('shopId) {! _.sum[Long]('quantity -> ‘totalSoldItems)! }! .groupAll { ! _.sortWithTake(('shopId, 'totalSold) -> 'x, 3) { ! (l: (Long, Long), r: (Long, Long)) => ! l._2 < l._2 ! }! }! .flatMapTo('x -> ('shopId, 'totalSold)) { ! x: List[(Long, Long)] => x! }! .write(Tsv(output, writeHeader = true)) “What’s the top 3 shops?” shopId! totalSoldItems! 2!! ! ! 144 ! 1!! ! ! 107! 3!! ! ! 16
  89. 89. Tsv(input, ('shopId, 'itemId, 'itemName, 'quantity))! .groupBy('shopId) {! _.sum[Long]('quantity -> ‘totalSoldItems)! }! .groupAll { ! _.sortWithTake(('shopId, 'totalSold) -> 'x, 3) { ! (l: (Long, Long), r: (Long, Long)) => ! l._2 < l._2 ! }! }! .flatMapTo('x -> ('shopId, 'totalSold)) { ! x: List[(Long, Long)] => x! }! .write(Tsv(output, writeHeader = true)) “What’s the top 3 shops?”
  90. 90. Tsv(input, ('shopId, 'itemId, 'itemName, 'quantity))! .groupBy('shopId) {! _.sum[Long]('quantity -> ‘totalSoldItems)! }! .groupAll { ! _.sortWithTake(('shopId, 'totalSold) -> 'x, 3) { ! (l: (Long, Long), r: (Long, Long)) => ! l._2 < l._2 ! }! }! .flatMapTo('x -> ('shopId, 'totalSold)) { ! x: List[(Long, Long)] => x! }! .write(Tsv(output, writeHeader = true)) “What’s the top 3 shops?” MUCH faster Job = Happier me.
  91. 91. Reduce, these Monoids
  92. 92. Reduce, these Monoids
  93. 93. trait Monoid[T] {! def zero: T! def +(a: T, b: T): T! } Reduce, these Monoids
  94. 94. Reduce, these Monoids trait Monoid[T] {! def zero: T! def +(a: T, b: T): T! } interface:
  95. 95. Reduce, these Monoids + 3 laws: trait Monoid[T] {! def zero: T! def +(a: T, b: T): T! } interface:
  96. 96. Reduce, these Monoids + 3 laws: Closure: trait Monoid[T] {! def zero: T! def +(a: T, b: T): T! } interface:
  97. 97. Reduce, these Monoids + 3 laws: (T, T) => TClosure: trait Monoid[T] {! def zero: T! def +(a: T, b: T): T! } ∀a,b∈T:a·b∈T interface:
  98. 98. Reduce, these Monoids + 3 laws: (T, T) => TClosure: Associativity: trait Monoid[T] {! def zero: T! def +(a: T, b: T): T! } ∀a,b∈T:a·b∈T interface:
  99. 99. Reduce, these Monoids + 3 laws: (T, T) => TClosure: Associativity: trait Monoid[T] {! def zero: T! def +(a: T, b: T): T! } ∀a,b∈T:a·b∈T ∀a,b,c∈T:(a·b)·c=a·(b·c) (a + b) + c! ==! a + (b + c) interface:
  100. 100. Reduce, these Monoids + 3 laws: (T, T) => TClosure: Associativity: Identity element: trait Monoid[T] {! def zero: T! def +(a: T, b: T): T! } ∀a,b∈T:a·b∈T ∀a,b,c∈T:(a·b)·c=a·(b·c) (a + b) + c! ==! a + (b + c) interface:
  101. 101. Reduce, these Monoids + 3 laws: (T, T) => TClosure: Associativity: Identity element: trait Monoid[T] {! def zero: T! def +(a: T, b: T): T! } ∀a,b∈T:a·b∈T ∀a,b,c∈T:(a·b)·c=a·(b·c) (a + b) + c! ==! a + (b + c) interface: ∃z∈T:∀a∈T:z·a=a·z=a z + a == a + z == a
  102. 102. Reduce, these Monoids object IntSum extends Monoid[Int] {! def zero = 0! def +(a: Int, b: Int) = a + b! } Summing:
  103. 103. Monoid ops can start “Map-side” bear, 2 car, 3 deer, 2 Monoid ops can already start being computed map-side! Monoid ops can already start being computed map-side! river, 2
  104. 104. Monoid ops can start “Map-side” average() sum() sortWithTake() histogram() Examples: bear, 2 car, 3 deer, 2 river, 2
  105. 105. Obligatory: “Go check out Algebird, NOW!” slide https://github.com/twitter/algebird ALGE-birds
  106. 106. BloomFilterMonoid https://github.com/twitter/algebird/wiki/Algebird-Examples-with-REPL val NUM_HASHES = 6! val WIDTH = 32! val SEED = 1! val bfMonoid = new BloomFilterMonoid(NUM_HASHES, WIDTH, SEED)! ! val bf1 = bfMonoid.create("1", "2", "3", "4", "100")! val bf2 = bfMonoid.create("12", "45")! val bf = bf1 ++ bf2! // bf: com.twitter.algebird.BF =! ! val approxBool = bf.contains("1")! // approxBool: com.twitter.algebird.ApproximateBoolean = ApproximateBoolean(true,0.9290349745708529)! ! val res = approxBool.isTrue! // res: Boolean = true
  107. 107. BloomFilterMonoid https://github.com/twitter/algebird/wiki/Algebird-Examples-with-REPL val NUM_HASHES = 6! val WIDTH = 32! val SEED = 1! val bfMonoid = new BloomFilterMonoid(NUM_HASHES, WIDTH, SEED)! ! val bf1 = bfMonoid.create("1", "2", "3", "4", "100")! val bf2 = bfMonoid.create("12", "45")! val bf = bf1 ++ bf2! // bf: com.twitter.algebird.BF =! ! val approxBool = bf.contains("1")! // approxBool: com.twitter.algebird.ApproximateBoolean = ApproximateBoolean(true,0.9290349745708529)! ! val res = approxBool.isTrue! // res: Boolean = true
  108. 108. BloomFilterMonoid https://github.com/twitter/algebird/wiki/Algebird-Examples-with-REPL val NUM_HASHES = 6! val WIDTH = 32! val SEED = 1! val bfMonoid = new BloomFilterMonoid(NUM_HASHES, WIDTH, SEED)! ! val bf1 = bfMonoid.create("1", "2", "3", "4", "100")! val bf2 = bfMonoid.create("12", "45")! val bf = bf1 ++ bf2! // bf: com.twitter.algebird.BF =! ! val approxBool = bf.contains("1")! // approxBool: com.twitter.algebird.ApproximateBoolean = ApproximateBoolean(true,0.9290349745708529)! ! val res = approxBool.isTrue! // res: Boolean = true
  109. 109. BloomFilterMonoid Csv(input, separator, ('shopId, 'itemId, 'itemName, 'quantity))! .groupBy('shopId) {! _.foldLeft('itemName -> 'itemBloom)(bfMonoid.zero) { ! (bf: BF, itemId: String) => bf + itemId ! }! }! .map(‘itemBloom -> 'hasSoldBeer) { b: BF => b.contains(“beer").isTrue }! .map('itemBloom -> 'hasSoldWurst) { b: BF => b.contains("wurst").isTrue }! .discard('itemBloom)! .write(Tsv(output, writeHeader = true))
  110. 110. BloomFilterMonoid Csv(input, separator, ('shopId, 'itemId, 'itemName, 'quantity))! .groupBy('shopId) {! _.foldLeft('itemName -> 'itemBloom)(bfMonoid.zero) { ! (bf: BF, itemId: String) => bf + itemId ! }! }! .map(‘itemBloom -> 'hasSoldBeer) { b: BF => b.contains(“beer").isTrue }! .map('itemBloom -> 'hasSoldWurst) { b: BF => b.contains("wurst").isTrue }! .discard('itemBloom)! .write(Tsv(output, writeHeader = true)) shopId! hasSoldBeer!hasSoldWurst! 1!! ! ! false!! ! ! true! 2!! ! ! false!! ! ! true! 3!! ! ! false!! ! ! true! 4!! ! ! true! ! ! ! false! 5!! ! ! true! ! ! ! false!
  111. 111. BloomFilterMonoid Csv(input, separator, ('shopId, 'itemId, 'itemName, 'quantity))! .groupBy('shopId) {! _.foldLeft('itemName -> 'itemBloom)(bfMonoid.zero) { ! (bf: BF, itemId: String) => bf + itemId ! }! }! .map(‘itemBloom -> 'hasSoldBeer) { b: BF => b.contains(“beer").isTrue }! .map('itemBloom -> 'hasSoldWurst) { b: BF => b.contains("wurst").isTrue }! .discard('itemBloom)! .write(Tsv(output, writeHeader = true)) shopId! hasSoldBeer!hasSoldWurst! 1!! ! ! false!! ! ! true! 2!! ! ! false!! ! ! true! 3!! ! ! false!! ! ! true! 4!! ! ! true! ! ! ! false! 5!! ! ! true! ! ! ! false! Why not Set[String]? It would OutOfMemory.
  112. 112. BloomFilterMonoid Csv(input, separator, ('shopId, 'itemId, 'itemName, 'quantity))! .groupBy('shopId) {! _.foldLeft('itemName -> 'itemBloom)(bfMonoid.zero) { ! (bf: BF, itemId: String) => bf + itemId ! }! }! .map(‘itemBloom -> 'hasSoldBeer) { b: BF => b.contains(“beer").isTrue }! .map('itemBloom -> 'hasSoldWurst) { b: BF => b.contains("wurst").isTrue }! .discard('itemBloom)! .write(Tsv(output, writeHeader = true)) shopId! hasSoldBeer!hasSoldWurst! 1!! ! ! false!! ! ! true! 2!! ! ! false!! ! ! true! 3!! ! ! false!! ! ! true! 4!! ! ! true! ! ! ! false! 5!! ! ! true! ! ! ! false! ApproximateBoolean(true,0.9999580954658956) Why not Set[String]? It would OutOfMemory.
  113. 113. Joins
  114. 114. Joins that.joinWithLarger('id1 -> 'id2, other)! that.joinWithSmaller('id1 -> 'id2, other)! ! ! that.joinWithTiny('id1 -> 'id2, other)
  115. 115. Joins that.joinWithLarger('id1 -> 'id2, other)! that.joinWithSmaller('id1 -> 'id2, other)! ! ! that.joinWithTiny('id1 -> 'id2, other) joinWithTiny is appropriate when you know that # of rows in bigger pipe > mappers * # rows in smaller pipe, where mappers is the number of mappers in the job.
  116. 116. Joins that.joinWithLarger('id1 -> 'id2, other)! that.joinWithSmaller('id1 -> 'id2, other)! ! ! that.joinWithTiny('id1 -> 'id2, other) joinWithTiny is appropriate when you know that # of rows in bigger pipe > mappers * # rows in smaller pipe, where mappers is the number of mappers in the job. The “usual”
  117. 117. Joins val people = IterableSource(! (1, “hans”) ::! (2, “bob”) ::! (3, “hermut”) ::! (4, “heinz”) ::! (5, “klemens”) :: … :: Nil,! ('id, 'name)) val cars = IterableSource(! (99, 1, “bmw") :: ! (123, 2, "mercedes”) ::! (240, 11, “other”) :: Nil,! ('carId, 'ownerId, 'carName))!
  118. 118. Joins import com.twitter.scalding.FunctionImplicits._! ! people.joinWithLarger('id -> 'ownerId, cars)! .map(('name, 'carName) -> 'sentence) { ! (name: String, car: String) =>! s"Hello $name, your $car is really nice"! }! .project('sentence)! .write(output) val people = IterableSource(! (1, “hans”) ::! (2, “bob”) ::! (3, “hermut”) ::! (4, “heinz”) ::! (5, “klemens”) :: … :: Nil,! ('id, 'name)) val cars = IterableSource(! (99, 1, “bmw") :: ! (123, 2, "mercedes”) ::! (240, 11, “other”) :: Nil,! ('carId, 'ownerId, 'carName))!
  119. 119. Joins import com.twitter.scalding.FunctionImplicits._! ! people.joinWithLarger('id -> 'ownerId, cars)! .map(('name, 'carName) -> 'sentence) { ! (name: String, car: String) =>! s"Hello $name, your $car is really nice"! }! .project('sentence)! .write(output) Hello hans, your bmw is really nice! Hello bob, your bob's car is really nice! val people = IterableSource(! (1, “hans”) ::! (2, “bob”) ::! (3, “hermut”) ::! (4, “heinz”) ::! (5, “klemens”) :: … :: Nil,! ('id, 'name)) val cars = IterableSource(! (99, 1, “bmw") :: ! (123, 2, "mercedes”) ::! (240, 11, “other”) :: Nil,! ('carId, 'ownerId, 'carName))!
  120. 120. “map-side” join that.joinWithTiny('id1 -> 'id2, tinyPipe) Choose this when: ! or: when the Left side is 3 orders of magnitude larger. Left > max(mappers,reducers) * Right!
  121. 121. Skew Joins val sampleRate = 0.001! val reducers = 10! val replicationFactor = 1! val replicator = SkewReplicationA(replicationFactor)! ! ! val genders: RichPipe = …! val followers: RichPipe = …! ! followers! .skewJoinWithSmaller('y1 -> 'y2, in1, sampleRate, reducers, replicator)! .project('x1, 'y1, 's1, 'x2, 'y2, 's2)! .write(Tsv("output"))
  122. 122. Skew Joins val sampleRate = 0.001! val reducers = 10! val replicationFactor = 1! val replicator = SkewReplicationA(replicationFactor)! ! ! val genders: RichPipe = …! val followers: RichPipe = …! ! followers! .skewJoinWithSmaller('y1 -> 'y2, in1, sampleRate, reducers, replicator)! .project('x1, 'y1, 's1, 'x2, 'y2, 's2)! .write(Tsv("output")) 1. Sample from the left and right pipes with some small probability,
 in order to determine approximately how often each join key appears in each pipe.
  123. 123. Skew Joins val sampleRate = 0.001! val reducers = 10! val replicationFactor = 1! val replicator = SkewReplicationA(replicationFactor)! ! ! val genders: RichPipe = …! val followers: RichPipe = …! ! followers! .skewJoinWithSmaller('y1 -> 'y2, in1, sampleRate, reducers, replicator)! .project('x1, 'y1, 's1, 'x2, 'y2, 's2)! .write(Tsv("output")) 1. Sample from the left and right pipes with some small probability,
 in order to determine approximately how often each join key appears in each pipe. 2. Use these estimated counts to replicate the join keys, 
 according to the given replication strategy.
  124. 124. Skew Joins val sampleRate = 0.001! val reducers = 10! val replicationFactor = 1! val replicator = SkewReplicationA(replicationFactor)! ! ! val genders: RichPipe = …! val followers: RichPipe = …! ! followers! .skewJoinWithSmaller('y1 -> 'y2, in1, sampleRate, reducers, replicator)! .project('x1, 'y1, 's1, 'x2, 'y2, 's2)! .write(Tsv("output")) 1. Sample from the left and right pipes with some small probability,
 in order to determine approximately how often each join key appears in each pipe. 2. Use these estimated counts to replicate the join keys, 
 according to the given replication strategy. 3. Join the replicated pipes together.
  125. 125. Where did my type-safety go?!
  126. 126. Where did my type-safety go?! Tsv(in, ('userId1, 'userId2, 'rel))! .filter('userId1) { uid1: Long => uid1 == 1337 }! .write(Tsv(out))!
  127. 127. Where did my type-safety go?! Tsv(in, ('userId1, 'userId2, 'rel))! .filter('userId1) { uid1: Long => uid1 == 1337 }! .write(Tsv(out))! Caused by: cascading.flow.FlowException: local step failed at cascading.flow.planner.FlowStepJob.blockOnJob(FlowStepJob.java:219) at cascading.flow.planner.FlowStepJob.start(FlowStepJob.java:149) at cascading.flow.planner.FlowStepJob.call(FlowStepJob.java:124) at cascading.flow.planner.FlowStepJob.call(FlowStepJob.java:43) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:744) Caused by: cascading.pipe.OperatorException: [com.twitter.scalding.C...][com.twitter.scalding.RichPipe.filter(RichPipe.scala:325)] operator Each failed executing operation at cascading.flow.stream.FilterEachStage.receive(FilterEachStage.java:81) at cascading.flow.stream.FilterEachStage.receive(FilterEachStage.java:34) at cascading.flow.stream.SourceStage.map(SourceStage.java:102) at cascading.flow.stream.SourceStage.call(SourceStage.java:53) at cascading.flow.stream.SourceStage.call(SourceStage.java:38) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:744) Caused by: java.lang.NumberFormatException: For input string: "bob" at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65) at java.lang.Long.parseLong(Long.java:589) at java.lang.Long.parseLong(Long.java:631) at cascading.tuple.coerce.LongCoerce.coerce(LongCoerce.java:50) at cascading.tuple.coerce.LongCoerce.coerce(LongCoerce.java:29)
  128. 128. Where did my type-safety go?! Tsv(in, ('userId1, 'userId2, 'rel))! .filter('userId1) { uid1: Long => uid1 == 1337 }! .write(Tsv(out))! Caused by: cascading.flow.FlowException: local step failed at cascading.flow.planner.FlowStepJob.blockOnJob(FlowStepJob.java:219) at cascading.flow.planner.FlowStepJob.start(FlowStepJob.java:149) at cascading.flow.planner.FlowStepJob.call(FlowStepJob.java:124) at cascading.flow.planner.FlowStepJob.call(FlowStepJob.java:43) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:744) Caused by: cascading.pipe.OperatorException: [com.twitter.scalding.C...][com.twitter.scalding.RichPipe.filter(RichPipe.scala:325)] operator Each failed executing operation at cascading.flow.stream.FilterEachStage.receive(FilterEachStage.java:81) at cascading.flow.stream.FilterEachStage.receive(FilterEachStage.java:34) at cascading.flow.stream.SourceStage.map(SourceStage.java:102) at cascading.flow.stream.SourceStage.call(SourceStage.java:53) at cascading.flow.stream.SourceStage.call(SourceStage.java:38) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:744) Caused by: java.lang.NumberFormatException: For input string: "bob" at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65) at java.lang.Long.parseLong(Long.java:589) at java.lang.Long.parseLong(Long.java:631) at cascading.tuple.coerce.LongCoerce.coerce(LongCoerce.java:50) at cascading.tuple.coerce.LongCoerce.coerce(LongCoerce.java:29) “oh, right… We changed that file to be user names, not ids…”
  129. 129. Trap it! Tsv(in, ('userId1, 'userId2, 'rel))! .addTrap(Tsv(“errors")) // add a trap! .filter('userId1) { uid1: Long => uid1 == 1337 }! .write(Tsv(out))
  130. 130. Trap it! Tsv(in, ('userId1, 'userId2, 'rel))! .addTrap(Tsv(“errors")) // add a trap! .filter('userId1) { uid1: Long => uid1 == 1337 }! .write(Tsv(out)) solves “dirty data”, no help for maintenance
  131. 131. Typed API
  132. 132. TypedAPI’s Tsv(in, ('userId1, 'userId2, 'rel))! .filter('userId1) { rel: Long => rel == 1337 }! .write(Tsv(out))!
  133. 133. TypedAPI’s Tsv(in, ('userId1, 'userId2, 'rel))! .filter('userId1) { rel: Long => rel == 1337 }! .write(Tsv(out))! import TDsl._! ! TypedCsv[(String, String, Int)](in, ('user1, 'user2, 'rel))! .filter { _._1 === "bob" }! .write(TypedTsv(out))!
  134. 134. TypedAPI’s Tsv(in, ('userId1, 'userId2, 'rel))! .filter('userId1) { rel: Long => rel == 1337 }! .write(Tsv(out))! import TDsl._! ! TypedCsv[(String, String, Int)](in, ('user1, 'user2, 'rel))! .filter { _._1 === "bob" }! .write(TypedTsv(out))! Must give Type to each Field
  135. 135. TypedAPI’s Tsv(in, ('userId1, 'userId2, 'rel))! .filter('userId1) { rel: Long => rel == 1337 }! .write(Tsv(out))! TypedCsv[(String, String)](in, ('user1, 'user2, 'rel))! .filter { _._1 === "bob" }! .write(TypedTsv(out))! import TDsl._! ! TypedCsv[(String, String, Int)](in, ('user1, 'user2, 'rel))! .filter { _._1 == "bob" }! .write(TypedTsv(out))!
  136. 136. TypedAPI’s Tsv(in, ('userId1, 'userId2, 'rel))! .filter('userId1) { rel: Long => rel == 1337 }! .write(Tsv(out))! TypedCsv[(String, String)](in, ('user1, 'user2, 'rel))! .filter { _._1 === "bob" }! .write(TypedTsv(out))! import TDsl._! ! TypedCsv[(String, String, Int)](in, ('user1, 'user2, 'rel))! .filter { _._1 == "bob" }! .write(TypedTsv(out))! Tuple arity: 2 Tuple arity: 3
  137. 137. TypedAPI’s Tsv(in, ('userId1, 'userId2, 'rel))! .filter('userId1) { rel: Long => rel == 1337 }! .write(Tsv(out))! Caused by: java.lang.IllegalArgumentException: num of types must equal number of fields: [{3}:'user1', 'user2', 'rel'], found: 2 at cascading.scheme.util.DelimitedParser.reset(DelimitedParser.java:176) TypedCsv[(String, String)](in, ('user1, 'user2, 'rel))! .filter { _._1 === "bob" }! .write(TypedTsv(out))! import TDsl._! ! TypedCsv[(String, String, Int)](in, ('user1, 'user2, 'rel))! .filter { _._1 == "bob" }! .write(TypedTsv(out))! Tuple arity: 2 Tuple arity: 3
  138. 138. TypedAPI’s Tsv(in, ('userId1, 'userId2, 'rel))! .filter('userId1) { rel: Long => rel == 1337 }! .write(Tsv(out))! Caused by: java.lang.IllegalArgumentException: num of types must equal number of fields: [{3}:'user1', 'user2', 'rel'], found: 2 at cascading.scheme.util.DelimitedParser.reset(DelimitedParser.java:176) TypedCsv[(String, String)](in, ('user1, 'user2, 'rel))! .filter { _._1 === "bob" }! .write(TypedTsv(out))! import TDsl._! ! TypedCsv[(String, String, Int)](in, ('user1, 'user2, 'rel))! .filter { _._1 == "bob" }! .write(TypedTsv(out))! Tuple arity: 2 Tuple arity: 3 “planing-time” exception
  139. 139. TypedAPI’s Tsv(in, ('userId1, 'userId2, 'rel))! .filter('userId1) { rel: Long => rel == 1337 }! .write(Tsv(out))! // … with Relationships {! import TDsl._! ! userRelationships(date)! .filter { _._ == "bob" }! .write(TypedTsv(out))! ! }
  140. 140. TypedAPI’s Tsv(in, ('userId1, 'userId2, 'rel))! .filter('userId1) { rel: Long => rel == 1337 }! .write(Tsv(out))! // … with Relationships {! import TDsl._! ! userRelationships(date)! .filter { _._ == "bob" }! .write(TypedTsv(out))! ! } Easier to reuse schemas now
  141. 141. TypedAPI’s Tsv(in, ('userId1, 'userId2, 'rel))! .filter('userId1) { rel: Long => rel == 1337 }! .write(Tsv(out))! // … with Relationships {! import TDsl._! ! userRelationships(date)! .filter { _._ == "bob" }! .write(TypedTsv(out))! ! } Easier to reuse schemas now Not coupled by Field names, but still too magic for reuse… “_1”?
  142. 142. TypedAPI’s Tsv(in, ('userId1, 'userId2, 'rel))! .filter('userId1) { rel: Long => rel == 1337 }! .write(Tsv(out))! // … with Relationships {! import TDsl._! ! userRelationships(date) ! .filter { p: Person => p.name == ”bob" }! .write(TypedTsv(out))! ! }
  143. 143. TypedAPI’s Tsv(in, ('userId1, 'userId2, 'rel))! .filter('userId1) { rel: Long => rel == 1337 }! .write(Tsv(out))! // … with Relationships {! import TDsl._! ! userRelationships(date) ! .filter { p: Person => p.name == ”bob" }! .write(TypedTsv(out))! ! } TypedPipe[Person]
  144. 144. Typed Joins case class UserName(id: Long, handle: String)! case class UserFavs(byUser: Long, favs: List[Long])! case class UserTweets(byUser: Long, tweets: List[Long])! ! def users: TypedSource[UserName]! def favs: TypedSource[UserFavs]! def tweets: TypedSource[UserTweets]! ! def output: TypedSink[(UserName, UserFavs, UserTweets)]! ! users.groupBy(_.id)! .join(favs.groupBy(_.byUser))! .join(tweets.groupBy(_.byUser))! .map { case (uid, ((user, favs), tweets)) =>! (user, favs, tweets)! } ! .write(output)!
  145. 145. Typed Joins case class UserName(id: Long, handle: String)! case class UserFavs(byUser: Long, favs: List[Long])! case class UserTweets(byUser: Long, tweets: List[Long])! ! def users: TypedSource[UserName]! def favs: TypedSource[UserFavs]! def tweets: TypedSource[UserTweets]! ! def output: TypedSink[(UserName, UserFavs, UserTweets)]! ! users.groupBy(_.id)! .join(favs.groupBy(_.byUser))! .join(tweets.groupBy(_.byUser))! .map { case (uid, ((user, favs), tweets)) =>! (user, favs, tweets)! } ! .write(output)!
  146. 146. Typed Joins case class UserName(id: Long, handle: String)! case class UserFavs(byUser: Long, favs: List[Long])! case class UserTweets(byUser: Long, tweets: List[Long])! ! def users: TypedSource[UserName]! def favs: TypedSource[UserFavs]! def tweets: TypedSource[UserTweets]! ! def output: TypedSink[(UserName, UserFavs, UserTweets)]! ! users.groupBy(_.id)! .join(favs.groupBy(_.byUser))! .join(tweets.groupBy(_.byUser))! .map { case (uid, ((user, favs), tweets)) =>! (user, favs, tweets)! } ! .write(output)! 3-way-merge in 1 MR step
  147. 147. > run pl.project13.oculus.job.WordCountJob ! —local —tool.graph --input in --output out! ! writing DOT: ! pl.project13.oculus.job.WordCountJob0.dot! ! writing Steps DOT: ! pl.project13.oculus.job.WordCountJob0_steps.dot Do the DOT
  148. 148. Do the DOT! ! ! ! pl.project13.oculus.job.WordCountJob0.dot! ! ! ! ! ! ! ! ! ! ! ! ! ! pl.project13.oculus.job.WordCountJob0_steps.dot
  149. 149. ! ! ! ! > dot -Tpng pl.project13.oculus.job.WordCountJob0.dot! ! ! ! ! ! ! ! ! ! ! ! ! ! Do the DOT
  150. 150. ! ! ! ! > dot -Tpng pl.project13.oculus.job.WordCountJob0.dot! ! ! ! ! ! ! ! ! ! ! ! ! ! Do the DOT M A P
  151. 151. ! ! ! ! > dot -Tpng pl.project13.oculus.job.WordCountJob0.dot! ! ! ! ! ! ! ! ! ! ! ! ! ! Do the DOT M A P R E D
  152. 152. Do the DOT
  153. 153. <3 Testing
  154. 154. class WordCountJobTest extends FlatSpec ! with ShouldMatchers with TupleConversions {! ! "WordCountJob" should "count words" in {! JobTest(new WordCountJob(_))! .arg("input", "inFile")! .arg("output", "outFile")! .source(TextLine("inFile"), List("0" -> "kapi kapi pi pu po"))! .sink[(String, Int)](Tsv("outFile")) { out =>! out.toList should contain ("kapi" -> 3)! }! .run! .finish! }! ! }! <3 Testing
  155. 155. class WordCountJobTest extends FlatSpec ! with ShouldMatchers with TupleConversions {! ! "WordCountJob" should "count words" in {! JobTest(new WordCountJob(_))! .arg("input", "inFile")! .arg("output", "outFile")! .source(TextLine("inFile"), List("0" -> "kapi kapi pi pu po"))! .sink[(String, Int)](Tsv("outFile")) { out =>! out.toList should contain ("kapi" -> 3)! }! .run! .finish! }! ! }! <3 Testing
  156. 156. class WordCountJobTest extends FlatSpec ! with ShouldMatchers with TupleConversions {! ! "WordCountJob" should "count words" in {! JobTest(new WordCountJob(_))! .arg("input", "inFile")! .arg("output", "outFile")! .source(TextLine("inFile"), List("0" -> "kapi kapi pi pu po"))! .sink[(String, Int)](Tsv("outFile")) { out =>! out.toList should contain ("kapi" -> 3)! }! .run! .finish! }! ! }! <3 Testing
  157. 157. class WordCountJobTest extends FlatSpec ! with ShouldMatchers with TupleConversions {! ! "WordCountJob" should "count words" in {! JobTest(new WordCountJob(_))! .arg("input", "inFile")! .arg("output", "outFile")! .source(TextLine("inFile"), List("0" -> "kapi kapi pi pu po"))! .sink[(String, Int)](Tsv("outFile")) { out =>! out.toList should contain ("kapi" -> 3)! }! .run! .finish! }! ! }! <3 Testing
  158. 158. class WordCountJobTest extends FlatSpec ! with ShouldMatchers with TupleConversions {! ! "WordCountJob" should "count words" in {! JobTest(new WordCountJob(_))! .arg("input", "inFile")! .arg("output", "outFile")! .source(TextLine("inFile"), List("0" -> "kapi kapi pi pu po"))! .sink[(String, Int)](Tsv("outFile")) { out =>! out.toList should contain ("kapi" -> 3)! }! .run! .finish! }! ! }! <3 Testing
  159. 159. class WordCountJobTest extends FlatSpec ! with ShouldMatchers with TupleConversions {! ! "WordCountJob" should "count words" in {! JobTest(new WordCountJob(_))! .arg("input", "inFile")! .arg("output", "outFile")! .source(TextLine("inFile"), List("0" -> "kapi kapi pi pu po"))! .sink[(String, Int)](Tsv("outFile")) { out =>! out.toList should contain ("kapi" -> 3)! }! .run! .finish! }! ! }! <3 Testing
  160. 160. class WordCountJobTest extends FlatSpec ! with ShouldMatchers with TupleConversions {! ! "WordCountJob" should "count words" in {! JobTest(new WordCountJob(_))! .arg("input", "inFile")! .arg("output", "outFile")! .source(TextLine("inFile"), List("0" -> "kapi kapi pi pu po"))! .sink[(String, Int)](Tsv("outFile")) { out =>! out.toList should contain ("kapi" -> 3)! }! .run! .finish! }! ! }! <3 Testing
  161. 161. class WordCountJobTest extends FlatSpec ! with ShouldMatchers with TupleConversions {! ! "WordCountJob" should "count words" in {! JobTest(new WordCountJob(_))! .arg("input", "inFile")! .arg("output", "outFile")! .source(TextLine("inFile"), List("0" -> "kapi kapi pi pu po"))! .sink[(String, Int)](Tsv("outFile")) { out =>! out.toList should contain ("kapi" -> 3)! }! .runHadoop! .finish! }! ! }! <3 Testing
  162. 162. class WordCountJobTest extends FlatSpec ! with ShouldMatchers with TupleConversions {! ! "WordCountJob" should "count words" in {! JobTest(new WordCountJob(_))! .arg("input", "inFile")! .arg("output", "outFile")! .source(TextLine("inFile"), List("0" -> "kapi kapi pi pu po"))! .sink[(String, Int)](Tsv("outFile")) { out =>! out.toList should contain ("kapi" -> 3)! }! .runHadoop! .finish! }! ! }! <3 Testing run || runHadoop
  163. 163. “Parallelize all the batches!”
  164. 164. “Parallelize all the batches!” Feels much like Scala collections
  165. 165. “Parallelize all the batches!” Feels much like Scala collections Local Mode thanks to Cascading
  166. 166. “Parallelize all the batches!” Feels much like Scala collections Local Mode thanks to Cascading Easy to add custom Taps
  167. 167. “Parallelize all the batches!” Feels much like Scala collections Local Mode thanks to Cascading Easy to add custom Taps Type Safe, when you want to
  168. 168. “Parallelize all the batches!” Feels much like Scala collections Local Mode thanks to Cascading Easy to add custom Taps Type Safe, when you want to Pure Scala
  169. 169. “Parallelize all the batches!” Feels much like Scala collections Local Mode thanks to Cascading Easy to add custom Taps Type Safe, when you want to Pure Scala Testing friendly
  170. 170. “Parallelize all the batches!” Feels much like Scala collections Local Mode thanks to Cascading Easy to add custom Taps Type Safe, when you want to Pure Scala Testing friendly
  171. 171. “Parallelize all the batches!” Feels much like Scala collections Local Mode thanks to Cascading Easy to add custom Taps Type Safe, when you want to Pure Scala Testing friendly Matrix API
  172. 172. “Parallelize all the batches!” Feels much like Scala collections Local Mode thanks to Cascading Easy to add custom Taps Type Safe, when you want to Pure Scala Testing friendly Matrix API Efficient columnar storage (Parquet)
  173. 173. Scalding Re-Cap ! ! ! ! ! TextLine(inputFile)! .flatMap('line -> 'word) { line: String => tokenize(line) }! .groupBy('word) { _.size }! .write(Tsv(outputFile))! ! !
  174. 174. Scalding Re-Cap ! ! ! ! ! TextLine(inputFile)! .flatMap('line -> 'word) { line: String => tokenize(line) }! .groupBy('word) { _.size }! .write(Tsv(outputFile))! ! ! 4{
  175. 175. ! ! ! ! ! $ activator new activator-scalding! ! Try it! http://typesafe.com/activator/template/activator-scalding Template by Dean Wampler
  176. 176. Loads Of Links 1. http://parleys.com/play/51c2e0f3e4b0ed877035684f/chapter0/about 2. https://github.com/twitter/scalding/blob/develop/scalding-core/src/main/scala/com/twitter/scalding/ReduceOperations.scala 3. http://www.slideshare.net/johnynek/scalding?qid=6db1da40-121b-4547-8aa6-4fb051343d91&v=qf1&b=&from_search=4 4. http://www.slideshare.net/Hadoop_Summit/severs-june26-255pmroom210av2? qid=6db1da40-121b-4547-8aa6-4fb051343d91&v=qf1&b=&from_search=3 5. http://www.slideshare.net/LivePersonDev/scalding-reaching-efficient-mapreduce? qid=6db1da40-121b-4547-8aa6-4fb051343d91&v=qf1&b=&from_search=2 6. http://www.michael-noll.com/blog/2013/12/02/twitter-algebird-monoid-monad-for-large-scala-data-analytics/ 7. http://blog.liveramp.com/2013/04/03/bloomjoin-bloomfilter-cogroup/ 8. https://engineering.twitter.com/university/videos/why-scalding-is-important-for-data-science 9. https://github.com/parquet/parquet-format 10. http://www.slideshare.net/ktoso/scalding-hadoop-word-count-in-less-than-60-lines-of-code 11. https://github.com/scalaz/scalaz 12. http://blog.echen.me/2012/02/09/movie-recommendations-and-more-via-mapreduce-and-scalding/
  177. 177. ! Danke! Dzięki! Thanks! Gracias! ありがとう! ktoso @ typesafe.com t: ktosopl / g: ktoso blog: project13.pl

×