Your SlideShare is downloading. ×
0
ScaldingHadoop Word Countin < 70 lines of code                  Konrad ktoso Malawski                 JARCamp #3 12.04.2013
ScaldingHadoop Word Count in   4 lines of code                   Konrad ktoso Malawski                  JARCamp #3 12.04.2...
softwaremill.com / java.pl / sckrk.com / geecon.org / krakowscala.pl / gdgkrakow.pl
Agenda
AgendaWhy Scalding? (10%)
AgendaWhy Scalding? (10%)       +
AgendaWhy Scalding? (10%)       +Hadoop Basics (20%)
AgendaWhy Scalding? (10%)       +Hadoop Basics (20%)       +
Agenda Why Scalding? (10%)          + Hadoop Basics (20%)          +Enter Cascading (40%)
Agenda Why Scalding? (10%)          + Hadoop Basics (20%)          +Enter Cascading (40%)          +
Agenda Why Scalding? (10%)           + Hadoop Basics (20%)           +Enter Cascading (40%)           + Hello Scalding (30%)
Agenda Why Scalding? (10%)           + Hadoop Basics (20%)           +Enter Cascading (40%)           + Hello Scalding (30...
Agenda Why Scalding? (10%)           + Hadoop Basics (20%)           +Enter Cascading (40%)           + Hello Scalding (30...
Why Scalding? Word Count in Typestype Word = Stringtype Count = IntString => Map[Word, Count]
Why Scalding? Word Count in Scala
Why Scalding?                Word Count in Scalaval text = "a a a b b"
Why Scalding?                Word Count in Scalaval text = "a a a b b"def wordCount(text: String): Map[Word, Count] =
Why Scalding?                Word Count in Scalaval text = "a a a b b"def wordCount(text: String): Map[Word, Count] =  text
Why Scalding?                Word Count in Scalaval text = "a a a b b"def wordCount(text: String): Map[Word, Count] =  tex...
Why Scalding?                Word Count in Scalaval text = "a a a b b"def wordCount(text: String): Map[Word, Count] =  tex...
Why Scalding?                Word Count in Scalaval text = "a a a b b"def wordCount(text: String): Map[Word, Count] =  tex...
Why Scalding?                Word Count in Scalaval text = "a a a b b"def wordCount(text: String): Map[Word, Count] =  tex...
Why Scalding?                Word Count in Scalaval text = "a a a b b"def wordCount(text: String): Map[Word, Count] =  tex...
Stuff > MemoryScala collections... fun but, memory bound!val text = "so many words... waaah! ..."  text    .split(" ")    ...
Stuff > MemoryScala collections... fun but, memory bound!                                            in Memoryval text = "...
Stuff > MemoryScala collections... fun but, memory bound!                                            in Memoryval text = "...
Stuff > MemoryScala collections... fun but, memory bound!                                            in Memoryval text = "...
Stuff > MemoryScala collections... fun but, memory bound!                                               in Memoryval text ...
Stuff > MemoryScala collections... fun but, memory bound!                                               in Memoryval text ...
Apache Hadoop (HDFS + MR)    http://hadoop.apache.org/
Why Scalding?                             Word Count in Hadoop MRpackage org.myorg;import   org.apache.hadoop.fs.Path;impo...
private final static IntWritable one = new IntWritable(1);                              Why Scalding?        private Text ...
Trivia: How old is Hadoop?
Cascadingwww.cascading.org/
Cascadingwww.cascading.org/
Cascading    is
Cascading     isTaps & Pipes
Cascading     isTaps & Pipes        & Sinks
1: Distributed Copy
1: Distributed Copy
1: Distributed Copy// source TapTap inTap = new Hfs(new TextDelimited(true, "t"), inPath);
1: Distributed Copy// source TapTap inTap = new Hfs(new TextDelimited(true, "t"), inPath);// sink TapTap outTap = new Hfs(...
1: Distributed Copy// source TapTap inTap = new Hfs(new TextDelimited(true, "t"), inPath);// sink TapTap outTap = new Hfs(...
1: Distributed Copy// source TapTap inTap = new Hfs(new TextDelimited(true, "t"), inPath);// sink TapTap outTap = new Hfs(...
1: Distributed Copy// source TapTap inTap = new Hfs(new TextDelimited(true, "t"), inPath);// sink TapTap outTap = new Hfs(...
1: Distributed Copy// source TapTap inTap = new Hfs(new TextDelimited(true, "t"), inPath);// sink TapTap outTap = new Hfs(...
1: Distributed Copy// source TapTap inTap = new Hfs(new TextDelimited(true, "t"), inPath);// sink TapTap outTap = new Hfs(...
1. DCP - Full Codepublic class Main {public static void main(String[] args ) {  String inPath = args[0];     String outPat...
1. DCP - Full Codepublic class Main {public static void main(String[] args ) {  String inPath = args[0];     String outPat...
1. DCP - Full Codepublic class Main {public static void main(String[] args ) {  String inPath = args[0];     String outPat...
1. DCP - Full Codepublic class Main {public static void main(String[] args ) {  String inPath = args[0];     String outPat...
1. DCP - Full Codepublic class Main {public static void main(String[] args ) {  String inPath = args[0];     String outPat...
1. DCP - Full Codepublic class Main {public static void main(String[] args ) {  String inPath = args[0];     String outPat...
2: Word CountString docPath = args[ 0 ];String wcPath = args[ 1 ];Properties properties = new Properties();AppProps.setApp...
2: Word Count  String docPath = args[ 0 ];  String wcPath = args[ 1 ];  Properties properties = new Properties();  AppProp...
2: Word Count  String docPath = args[ 0 ];  String wcPath = args[ 1 ];  Properties properties = new Properties();  AppProp...
2: Word CountString docPath = args[ 0 ];String wcPath = args[ 1 ];Properties properties = new Properties();AppProps.setApp...
String wcPath = args[ 1 ];            2: Word Count            2: Word CountProperties properties = new Properties();AppPr...
AppProps.setApplicationJarClass( props, Main.class );HadoopFlowConnector flowConnector = new HadoopFlowConnector( props );...
AppProps.setApplicationJarClass( props, Main.class );HadoopFlowConnector flowConnector = new HadoopFlowConnector( props );...
AppProps.setApplicationJarClass( props, Main.class );HadoopFlowConnector flowConnector = new HadoopFlowConnector( props );...
AppProps.setApplicationJarClass( props, Main.class );HadoopFlowConnector flowConnector = new HadoopFlowConnector( props );...
AppProps.setApplicationJarClass( props, Main.class );HadoopFlowConnector flowConnector = new HadoopFlowConnector( props );...
AppProps.setApplicationJarClass( props, Main.class );HadoopFlowConnector flowConnector = new HadoopFlowConnector( props );...
AppProps.setApplicationJarClass( props, Main.class );HadoopFlowConnector flowConnector = new HadoopFlowConnector( props );...
Fields token = new Fields( "token" );                2: Word Count    Fields text = new Fields( "text" );    RegexSplitGen...
Pipe docPipe = new Each( "token", text, splitter, Fields.RESULTS );                2: Word Count                How its ma...
Pipe docPipe = new Each( "token", text, splitter, Fields.RESULTS );                2: Word Count                How its ma...
2: Word CountHow its madehttp://www.cascading.org/2012/07/09/cascading-for-the-impatient-part-2/
How its made
How its madeval flow = FlowDef
How its madeval flow = FlowDef// pseudo code...
How its madeval flow = FlowDef// pseudo code...val jobs: List[MRJob] = flowConnector(flow)
How its madeval flow = FlowDef// pseudo code...val jobs: List[MRJob] = flowConnector(flow)// pseudo code...
How its madeval flow = FlowDef// pseudo code...val jobs: List[MRJob] = flowConnector(flow)// pseudo code...HadoopCluster.e...
How its madeval flow = FlowDef// pseudo code...val jobs: List[MRJob] = flowConnector(flow)// pseudo code...HadoopCluster.e...
Cascading tipsPipe assembly = new Pipe( "assembly" );assembly = new Each( assembly, DebugLevel.VERBOSE, new Debug() );// ....
Cascading tipsPipe assembly = new Pipe( "assembly" );assembly = new Each( assembly, DebugLevel.VERBOSE, new Debug() );// ....
Cascading tipsPipe assembly = new Pipe( "assembly" );assembly = new Each( assembly, DebugLevel.VERBOSE, new Debug() );// ....
Scalding     =     +   Twitter Scaldinggithub.com/twitter/scalding
Scalding API
map
mapScala:val data = 1 :: 2 :: 3 :: Nil
mapScala:val data = 1 :: 2 :: 3 :: Nilval doubled = data map { _ * 2 }
mapScala:val data = 1 :: 2 :: 3 :: Nilval doubled = data map { _ * 2 }                                   // Int => Int
mapScala:val data = 1 :: 2 :: 3 :: Nilval doubled = data map { _ * 2 }                                   // Int => Int
map Scala:  val data = 1 :: 2 :: 3 :: Nil  val doubled = data map { _ * 2 }                                     // Int => ...
map Scala:  val data = 1 :: 2 :: 3 :: Nil  val doubled = data map { _ * 2 }                                        // Int ...
map Scala:  val data = 1 :: 2 :: 3 :: Nil  val doubled = data map { _ * 2 }                                        // Int ...
map Scala:  val data = 1 :: 2 :: 3 :: Nil  val doubled = data map { _ * 2 }                                             //...
map Scala:  val data = 1 :: 2 :: 3 :: Nil  val doubled = data map { _ * 2 }                                             //...
map Scala:  val data = 1 :: 2 :: 3 :: Nil  val doubled = data map { _ * 2 }                                          // In...
mapTo
mapToScala:var data = 1 :: 2 :: 3 :: Nil
mapToScala:var data = 1 :: 2 :: 3 :: Nilval doubled = data map { _ * 2 }
mapToScala:var data = 1 :: 2 :: 3 :: Nilval doubled = data map { _ * 2 }data = null
mapToScala:var data = 1 :: 2 :: 3 :: Nilval doubled = data map { _ * 2 }data = null                                   // I...
mapToScala:var data = 1 :: 2 :: 3 :: Nilval doubled = data map { _ * 2 }data = null                                   // I...
mapToScala:var data = 1 :: 2 :: 3 :: Nilval doubled = data map { _ * 2 }data = null                                   // I...
mapTo Scala:  var data = 1 :: 2 :: 3 :: Nil  val doubled = data map { _ * 2 }  data = null                                ...
mapTo Scala:  var data = 1 :: 2 :: 3 :: Nil  val doubled = data map { _ * 2 }  data = null                                ...
mapTo Scala:  var data = 1 :: 2 :: 3 :: Nil  val doubled = data map { _ * 2 }  data = null                                ...
mapTo Scala:  var data = 1 :: 2 :: 3 :: Nil  val doubled = data map { _ * 2 }  data = null                                ...
mapTo Scala:  var data = 1 :: 2 :: 3 :: Nil  val doubled = data map { _ * 2 }  data = null                                ...
flatMap
flatMapScala:val data = "1" :: "2,2" :: "3,3,3" :: Nil   // List[String]
flatMapScala:val data = "1" :: "2,2" :: "3,3,3" :: Nil   // List[String]val numbers = data flatMap { line =>   // String
flatMapScala:val data = "1" :: "2,2" :: "3,3,3" :: Nil   // List[String]val numbers = data flatMap { line =>   // String  l...
flatMapScala:val data = "1" :: "2,2" :: "3,3,3" :: Nil   // List[String]val numbers = data flatMap { line =>   // String  l...
flatMapScala:val data = "1" :: "2,2" :: "3,3,3" :: Nil   // List[String]val numbers = data flatMap { line =>   // String  l...
flatMapScala:val data = "1" :: "2,2" :: "3,3,3" :: Nil   // List[String]val numbers = data flatMap { line =>   // String  l...
flatMapScala:val data = "1" :: "2,2" :: "3,3,3" :: Nil   // List[String]val numbers = data flatMap { line =>   // String  l...
flatMap Scala: val data = "1" :: "2,2" :: "3,3,3" :: Nil   // List[String] val numbers = data flatMap { line =>   // String...
flatMap Scala: val data = "1" :: "2,2" :: "3,3,3" :: Nil   // List[String] val numbers = data flatMap { line =>   // String...
flatMap Scala: val data = "1" :: "2,2" :: "3,3,3" :: Nil   // List[String] val numbers = data flatMap { line =>   // String...
flatMap Scala: val data = "1" :: "2,2" :: "3,3,3" :: Nil   // List[String] val numbers = data flatMap { line =>   // String...
flatMap
flatMapScala:val data = "1" :: "2,2" :: "3,3,3" :: Nil   // List[String]
flatMapScala:val data = "1" :: "2,2" :: "3,3,3" :: Nil   // List[String]val numbers = data flatMap { line =>   // String
flatMapScala:val data = "1" :: "2,2" :: "3,3,3" :: Nil   // List[String]val numbers = data flatMap { line =>   // String  l...
flatMapScala:val data = "1" :: "2,2" :: "3,3,3" :: Nil   // List[String]val numbers = data flatMap { line =>   // String  l...
flatMapScala:val data = "1" :: "2,2" :: "3,3,3" :: Nil   // List[String]val numbers = data flatMap { line =>   // String  l...
flatMapScala:val data = "1" :: "2,2" :: "3,3,3" :: Nil   // List[String]val numbers = data flatMap { line =>   // String  l...
flatMapScala:val data = "1" :: "2,2" :: "3,3,3" :: Nil   // List[String]val numbers = data flatMap { line =>   // String  l...
flatMap Scala: val data = "1" :: "2,2" :: "3,3,3" :: Nil   // List[String] val numbers = data flatMap { line =>   // String...
flatMap Scala: val data = "1" :: "2,2" :: "3,3,3" :: Nil   // List[String] val numbers = data flatMap { line =>   // String...
flatMap Scala: val data = "1" :: "2,2" :: "3,3,3" :: Nil   // List[String] val numbers = data flatMap { line =>   // String...
flatMap Scala: val data = "1" :: "2,2" :: "3,3,3" :: Nil   // List[String] val numbers = data flatMap { line =>   // String...
groupBy
groupByScala:val data = 1 :: 2 :: 30 :: 42 :: Nil   // List[Int]
groupByScala:val data = 1 :: 2 :: 30 :: 42 :: Nil   // List[Int]val groups = data groupBy { _ < 10 }
groupByScala:val data = 1 :: 2 :: 30 :: 42 :: Nil   // List[Int]val groups = data groupBy { _ < 10 }groups         // Map[...
groupByScala:val data = 1 :: 2 :: 30 :: 42 :: Nil     // List[Int]val groups = data groupBy { _ < 10 }groups         // Ma...
groupByScala:val data = 1 :: 2 :: 30 :: 42 :: Nil   // List[Int]val groups = data groupBy { _ < 10 }groups         // Map[...
groupByScala:val data = 1 :: 2 :: 30 :: 42 :: Nil   // List[Int]val groups = data groupBy { _ < 10 }groups         // Map[...
groupBy Scala: val data = 1 :: 2 :: 30 :: 42 :: Nil   // List[Int] val groups = data groupBy { _ < 10 } groups         // ...
groupBy Scala: val data = 1 :: 2 :: 30 :: 42 :: Nil   // List[Int] val groups = data groupBy { _ < 10 } groups         // ...
groupBy Scala: val data = 1 :: 2 :: 30 :: 42 :: Nil   // List[Int] val groups = data groupBy { _ < 10 } groups         // ...
groupBy Scala: val data = 1 :: 2 :: 30 :: 42 :: Nil         // List[Int] val groups = data groupBy { _ < 10 } groups      ...
groupBy Scala: val data = 1 :: 2 :: 30 :: 42 :: Nil         // List[Int] val groups = data groupBy { _ < 10 } groups      ...
groupByScalding:
groupByScalding: IterableSource(List(1, 2, 30, 42), num)
groupByScalding: IterableSource(List(1, 2, 30, 42), num)     .map(num -> lessThanTen) { i: Int => i < 10 }
groupByScalding: IterableSource(List(1, 2, 30, 42), num)     .map(num -> lessThanTen) { i: Int => i < 10 }     .groupBy(le...
groupByScalding: IterableSource(List(1, 2, 30, 42), num)     .map(num -> lessThanTen) { i: Int => i < 10 }     .groupBy(le...
Scalding API
Scalding API  project / discard
Scalding API  project / discard    map / mapTo
Scalding API  project / discard    map / mapTo flatMap / flatMapTo
Scalding API  project / discard    map / mapTo flatMap / flatMapTo      rename
Scalding API  project / discard    map / mapTo flatMap / flatMapTo      rename       filter
Scalding API  project / discard    map / mapTo flatMap / flatMapTo      rename        filter       unique
Scalding API             project / discard               map / mapTo            flatMap / flatMapTo                 rename  ...
Scalding API             project / discard               map / mapTo            flatMap / flatMapTo                 rename  ...
Scalding API             project / discard               map / mapTo            flatMap / flatMapTo                 rename  ...
Scalding API             project / discard               map / mapTo            flatMap / flatMapTo                 rename  ...
Scalding API             project / discard               map / mapTo            flatMap / flatMapTo                 rename  ...
Distributed Copy in Scaldingclass WordCountJob(args: Args) extends Job(args) {
Distributed Copy in Scaldingclass WordCountJob(args: Args) extends Job(args) {  val input = Tsv(args("input"))  val output...
Distributed Copy in Scaldingclass WordCountJob(args: Args) extends Job(args) {    val input = Tsv(args("input"))    val ou...
Distributed Copy in Scaldingclass WordCountJob(args: Args) extends Job(args) {    val input = Tsv(args("input"))    val ou...
Main Class - "Runner"import org.apache.hadoop.util.ToolRunnerimport com.twitter.scaldingobject ScaldingJobRunner extends A...
Main Class - "Runner"import org.apache.hadoop.util.ToolRunnerimport com.twitter.scaldingobject ScaldingJobRunner extends A...
Word Count in Scaldingclass WordCountJob(args: Args) extends Job(args) {}
Word Count in Scaldingclass WordCountJob(args: Args) extends Job(args) {    val inputFile = args("input")    val outputFil...
Word Count in Scaldingclass WordCountJob(args: Args) extends Job(args) {    val inputFile = args("input")    val outputFil...
Word Count in Scaldingclass WordCountJob(args: Args) extends Job(args) {    val inputFile = args("input")    val outputFil...
Word Count in Scaldingclass WordCountJob(args: Args) extends Job(args) {    val inputFile = args("input")    val outputFil...
Word Count in Scaldingclass WordCountJob(args: Args) extends Job(args) {    val inputFile = args("input")    val outputFil...
Word Count in Scaldingclass WordCountJob(args: Args) extends Job(args) {    val inputFile = args("input")    val outputFil...
Word Count in Scaldingclass WordCountJob(args: Args) extends Job(args) {    val inputFile = args("input")    val outputFil...
Word Count in Scalding class WordCountJob(args: Args) extends Job(args) {     val inputFile = args("input")     val output...
Word Count in Scalding
Word Count in Scaldingrun pl.project13.scala.oculus.job.WordCountJob --tool.graph
Word Count in Scaldingrun pl.project13.scala.oculus.job.WordCountJob --tool.graph=> pl.project13.scala.oculus.job.WordCoun...
Word Count in Scaldingrun pl.project13.scala.oculus.job.WordCountJob --tool.graph=> pl.project13.scala.oculus.job.WordCoun...
Word Count in Scaldingrun pl.project13.scala.oculus.job.WordCountJob --tool.graph=> pl.project13.scala.oculus.job.WordCoun...
Word Count in ScaldingTextLine(inputFile)  .flatMap(line -> word) { line: String => tokenize(line) }  .groupBy(word) { _.s...
Word Count in ScaldingTextLine(inputFile)  .flatMap(line -> word) { line: String => tokenize(line) }  .groupBy(word) { _.s...
Word Count in ScaldingTextLine(inputFile)  .flatMap(line -> word) { line: String => tokenize(line) }  .groupBy(word) { _.s...
Why Scalding?
Why Scalding? Hadoop inside
Why Scalding?    Hadoop insideCascading abstractions
Why Scalding?    Hadoop insideCascading abstractions  Scala conciseness
Ask Stuff!      Dzięki!      Thanks!     ありがとう!Konrad Malawski @ java.plt: ktosopl / g: ktosob: blog.project13.pl
Scalding - Hadoop Word Count in LESS than 70 lines of code
Scalding - Hadoop Word Count in LESS than 70 lines of code
Scalding - Hadoop Word Count in LESS than 70 lines of code
Scalding - Hadoop Word Count in LESS than 70 lines of code
Scalding - Hadoop Word Count in LESS than 70 lines of code
Scalding - Hadoop Word Count in LESS than 70 lines of code
Scalding - Hadoop Word Count in LESS than 70 lines of code
Scalding - Hadoop Word Count in LESS than 70 lines of code
Upcoming SlideShare
Loading in...5
×

Scalding - Hadoop Word Count in LESS than 70 lines of code

7,745

Published on

Twitter Scalding is built on top of Cascading, which is built on top of Hadoop. It's basically a very nice to read and extend DSL for writing map reduce jobs.

Transcript of "Scalding - Hadoop Word Count in LESS than 70 lines of code"

  1. 1. ScaldingHadoop Word Countin < 70 lines of code Konrad ktoso Malawski JARCamp #3 12.04.2013
  2. 2. ScaldingHadoop Word Count in 4 lines of code Konrad ktoso Malawski JARCamp #3 12.04.2013
  3. 3. softwaremill.com / java.pl / sckrk.com / geecon.org / krakowscala.pl / gdgkrakow.pl
  4. 4. Agenda
  5. 5. AgendaWhy Scalding? (10%)
  6. 6. AgendaWhy Scalding? (10%) +
  7. 7. AgendaWhy Scalding? (10%) +Hadoop Basics (20%)
  8. 8. AgendaWhy Scalding? (10%) +Hadoop Basics (20%) +
  9. 9. Agenda Why Scalding? (10%) + Hadoop Basics (20%) +Enter Cascading (40%)
  10. 10. Agenda Why Scalding? (10%) + Hadoop Basics (20%) +Enter Cascading (40%) +
  11. 11. Agenda Why Scalding? (10%) + Hadoop Basics (20%) +Enter Cascading (40%) + Hello Scalding (30%)
  12. 12. Agenda Why Scalding? (10%) + Hadoop Basics (20%) +Enter Cascading (40%) + Hello Scalding (30%) =
  13. 13. Agenda Why Scalding? (10%) + Hadoop Basics (20%) +Enter Cascading (40%) + Hello Scalding (30%) = 100%
  14. 14. Why Scalding? Word Count in Typestype Word = Stringtype Count = IntString => Map[Word, Count]
  15. 15. Why Scalding? Word Count in Scala
  16. 16. Why Scalding? Word Count in Scalaval text = "a a a b b"
  17. 17. Why Scalding? Word Count in Scalaval text = "a a a b b"def wordCount(text: String): Map[Word, Count] =
  18. 18. Why Scalding? Word Count in Scalaval text = "a a a b b"def wordCount(text: String): Map[Word, Count] = text
  19. 19. Why Scalding? Word Count in Scalaval text = "a a a b b"def wordCount(text: String): Map[Word, Count] = text .split(" ")
  20. 20. Why Scalding? Word Count in Scalaval text = "a a a b b"def wordCount(text: String): Map[Word, Count] = text .split(" ") .map(a => (a, 1))
  21. 21. Why Scalding? Word Count in Scalaval text = "a a a b b"def wordCount(text: String): Map[Word, Count] = text .split(" ") .map(a => (a, 1)) .groupBy(_._1)
  22. 22. Why Scalding? Word Count in Scalaval text = "a a a b b"def wordCount(text: String): Map[Word, Count] = text .split(" ") .map(a => (a, 1)) .groupBy(_._1) .map { a => a._1 -> a._2.map(_._2).sum }
  23. 23. Why Scalding? Word Count in Scalaval text = "a a a b b"def wordCount(text: String): Map[Word, Count] = text .split(" ") .map(a => (a, 1)) .groupBy(_._1) .map { a => a._1 -> a._2.map(_._2).sum }wordCount(text) should equal (Map("a" -> 3), ("b" -> 2)))
  24. 24. Stuff > MemoryScala collections... fun but, memory bound!val text = "so many words... waaah! ..." text .split(" ") .map(a => (a, 1)) .groupBy(_._1) .map(a => (a._1, a._2.map(_._2).sum))
  25. 25. Stuff > MemoryScala collections... fun but, memory bound! in Memoryval text = "so many words... waaah! ..." text .split(" ") .map(a => (a, 1)) .groupBy(_._1) .map(a => (a._1, a._2.map(_._2).sum))
  26. 26. Stuff > MemoryScala collections... fun but, memory bound! in Memoryval text = "so many words... waaah! ..." in Memory text .split(" ") .map(a => (a, 1)) .groupBy(_._1) .map(a => (a._1, a._2.map(_._2).sum))
  27. 27. Stuff > MemoryScala collections... fun but, memory bound! in Memoryval text = "so many words... waaah! ..." in Memory text in Memory .split(" ") .map(a => (a, 1)) .groupBy(_._1) .map(a => (a._1, a._2.map(_._2).sum))
  28. 28. Stuff > MemoryScala collections... fun but, memory bound! in Memoryval text = "so many words... waaah! ..." in Memory text in Memory .split(" ") .map(a => (a, 1)) in Memory .groupBy(_._1) .map(a => (a._1, a._2.map(_._2).sum))
  29. 29. Stuff > MemoryScala collections... fun but, memory bound! in Memoryval text = "so many words... waaah! ..." in Memory text in Memory .split(" ") .map(a => (a, 1)) in Memory .groupBy(_._1) .map(a => (a._1, a._2.map(_._2).sum)) in Memory
  30. 30. Apache Hadoop (HDFS + MR) http://hadoop.apache.org/
  31. 31. Why Scalding? Word Count in Hadoop MRpackage org.myorg;import org.apache.hadoop.fs.Path;import org.apache.hadoop.io.IntWritable;import org.apache.hadoop.io.LongWritable;import org.apache.hadoop.io.Text;import org.apache.hadoop.mapred.*;import java.io.IOException;import java.util.Iterator;import java.util.StringTokenizer;public class WordCount { public static class Map extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throIOException { String line = value.toString(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasMoreTokens()) { word.set(tokenizer.nextToken()); output.collect(word, one);
  32. 32. private final static IntWritable one = new IntWritable(1); Why Scalding? private Text word = new Text(); public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throIOException { String line = value.toString(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasMoreTokens()) { word.set(tokenizer.nextToken()); Word Count in Hadoop MR output.collect(word, one); } } } public static class Reduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporterreporter) throws IOException { int sum = 0; while (values.hasNext()) { sum += values.next().get(); } output.collect(key, new IntWritable(sum)); } } public static void main(String[] args) throws Exception { JobConf conf = new JobConf(WordCount.class); conf.setJobName("wordcount"); conf.setOutputKeyClass(Text.class); conf.setOutputValueClass(IntWritable.class); conf.setMapperClass(Map.class); conf.setCombinerClass(Reduce.class); conf.setReducerClass(Reduce.class); conf.setInputFormat(TextInputFormat.class); conf.setOutputFormat(TextOutputFormat.class); FileInputFormat.setInputPaths(conf, new Path(args[0])); FileOutputFormat.setOutputPath(conf, new Path(args[1])); JobClient.runJob(conf); }}
  33. 33. Trivia: How old is Hadoop?
  34. 34. Cascadingwww.cascading.org/
  35. 35. Cascadingwww.cascading.org/
  36. 36. Cascading is
  37. 37. Cascading isTaps & Pipes
  38. 38. Cascading isTaps & Pipes & Sinks
  39. 39. 1: Distributed Copy
  40. 40. 1: Distributed Copy
  41. 41. 1: Distributed Copy// source TapTap inTap = new Hfs(new TextDelimited(true, "t"), inPath);
  42. 42. 1: Distributed Copy// source TapTap inTap = new Hfs(new TextDelimited(true, "t"), inPath);// sink TapTap outTap = new Hfs(new TextDelimited(true, "t"), outPath);
  43. 43. 1: Distributed Copy// source TapTap inTap = new Hfs(new TextDelimited(true, "t"), inPath);// sink TapTap outTap = new Hfs(new TextDelimited(true, "t"), outPath);// a Pipe, connects tapsPipe copyPipe = new Pipe("copy");
  44. 44. 1: Distributed Copy// source TapTap inTap = new Hfs(new TextDelimited(true, "t"), inPath);// sink TapTap outTap = new Hfs(new TextDelimited(true, "t"), outPath);// a Pipe, connects tapsPipe copyPipe = new Pipe("copy");// build the FlowFlowDef flowDef = FlowDef.flowDef()
  45. 45. 1: Distributed Copy// source TapTap inTap = new Hfs(new TextDelimited(true, "t"), inPath);// sink TapTap outTap = new Hfs(new TextDelimited(true, "t"), outPath);// a Pipe, connects tapsPipe copyPipe = new Pipe("copy");// build the FlowFlowDef flowDef = FlowDef.flowDef() .addSource( copyPipe, inTap )
  46. 46. 1: Distributed Copy// source TapTap inTap = new Hfs(new TextDelimited(true, "t"), inPath);// sink TapTap outTap = new Hfs(new TextDelimited(true, "t"), outPath);// a Pipe, connects tapsPipe copyPipe = new Pipe("copy");// build the FlowFlowDef flowDef = FlowDef.flowDef() .addSource(copyPipe, inTap) .addTailSink(copyPipe, outTap);
  47. 47. 1: Distributed Copy// source TapTap inTap = new Hfs(new TextDelimited(true, "t"), inPath);// sink TapTap outTap = new Hfs(new TextDelimited(true, "t"), outPath);// a Pipe, connects tapsPipe copyPipe = new Pipe("copy");// build the FlowFlowDef flowDef = FlowDef.flowDef() .addSource(copyPipe, inTap) .addTailSink(copyPipe, outTap);// run!flowConnector.connect(flowDef).complete();
  48. 48. 1. DCP - Full Codepublic class Main {public static void main(String[] args ) { String inPath = args[0]; String outPath = args[1]; Properties props = new Properties(); AppProps.setApplicationJarClass(properties, Main.class); HadoopFlowConnector flowConnector = new HadoopFlowConnector(props); Tap inTap = new Hfs( new TextDelimited(true, "t"), inPath); Tap outTap = new Hfs(new TextDelimited(true, "t"), outPath); Pipe copyPipe = new Pipe("copy"); FlowDef flowDef = FlowDef.flowDef() .addSource(copyPipe, inTap) .addTailSink(copyPipe, outTap); flowConnector.connect(flowDef).complete();}}
  49. 49. 1. DCP - Full Codepublic class Main {public static void main(String[] args ) { String inPath = args[0]; String outPath = args[1]; Properties props = new Properties(); AppProps.setApplicationJarClass(properties, Main.class); HadoopFlowConnector flowConnector = new HadoopFlowConnector(props); Tap inTap = new Hfs( new TextDelimited(true, "t"), inPath); Tap outTap = new Hfs(new TextDelimited(true, "t"), outPath); Pipe copyPipe = new Pipe("copy"); FlowDef flowDef = FlowDef.flowDef() .addSource(copyPipe, inTap) .addTailSink(copyPipe, outTap); flowConnector.connect(flowDef).complete();}}
  50. 50. 1. DCP - Full Codepublic class Main {public static void main(String[] args ) { String inPath = args[0]; String outPath = args[1]; Properties props = new Properties(); AppProps.setApplicationJarClass(properties, Main.class); HadoopFlowConnector flowConnector = new HadoopFlowConnector(props); Tap inTap = new Hfs( new TextDelimited(true, "t"), inPath); Tap outTap = new Hfs(new TextDelimited(true, "t"), outPath); Pipe copyPipe = new Pipe("copy"); FlowDef flowDef = FlowDef.flowDef() .addSource(copyPipe, inTap) .addTailSink(copyPipe, outTap); flowConnector.connect(flowDef).complete();}}
  51. 51. 1. DCP - Full Codepublic class Main {public static void main(String[] args ) { String inPath = args[0]; String outPath = args[1]; Properties props = new Properties(); AppProps.setApplicationJarClass(properties, Main.class); HadoopFlowConnector flowConnector = new HadoopFlowConnector(props); Tap inTap = new Hfs( new TextDelimited(true, "t"), inPath); Tap outTap = new Hfs(new TextDelimited(true, "t"), outPath); Pipe copyPipe = new Pipe("copy"); FlowDef flowDef = FlowDef.flowDef() .addSource(copyPipe, inTap) .addTailSink(copyPipe, outTap); flowConnector.connect(flowDef).complete();}}
  52. 52. 1. DCP - Full Codepublic class Main {public static void main(String[] args ) { String inPath = args[0]; String outPath = args[1]; Properties props = new Properties(); AppProps.setApplicationJarClass(properties, Main.class); HadoopFlowConnector flowConnector = new HadoopFlowConnector(props); Tap inTap = new Hfs( new TextDelimited(true, "t"), inPath); Tap outTap = new Hfs(new TextDelimited(true, "t"), outPath); Pipe copyPipe = new Pipe("copy"); FlowDef flowDef = FlowDef.flowDef() .addSource(copyPipe, inTap) .addTailSink(copyPipe, outTap); flowConnector.connect(flowDef).complete();}}
  53. 53. 1. DCP - Full Codepublic class Main {public static void main(String[] args ) { String inPath = args[0]; String outPath = args[1]; Properties props = new Properties(); AppProps.setApplicationJarClass(properties, Main.class); HadoopFlowConnector flowConnector = new HadoopFlowConnector(props); Tap inTap = new Hfs( new TextDelimited(true, "t"), inPath); Tap outTap = new Hfs(new TextDelimited(true, "t"), outPath); Pipe copyPipe = new Pipe("copy"); FlowDef flowDef = FlowDef.flowDef() .addSource(copyPipe, inTap) .addTailSink(copyPipe, outTap); flowConnector.connect(flowDef).complete();}}
  54. 54. 2: Word CountString docPath = args[ 0 ];String wcPath = args[ 1 ];Properties properties = new Properties();AppProps.setApplicationJarClass( props, Main.class );HadoopFlowConnector flowConnector = new HadoopFlowConnector( props );// create source and sink tapsTap docTap = new Hfs( new TextDelimited( true, "t" ), docPath );Tap wcTap = new Hfs( new TextDelimited( true, "t" ), wcPath );
  55. 55. 2: Word Count String docPath = args[ 0 ]; String wcPath = args[ 1 ]; Properties properties = new Properties(); AppProps.setApplicationJarClass( props, Main.class ); HadoopFlowConnector flowConnector = new HadoopFlowConnector( props ); // create source and sink taps Tap docTap = new Hfs( new TextDelimited( true, "t" ), docPath ); Tap wcTap = new Hfs( new TextDelimited( true, "t" ), wcPath ); // specify a regex operation to split the "document" text lines into aken stream
  56. 56. 2: Word Count String docPath = args[ 0 ]; String wcPath = args[ 1 ]; Properties properties = new Properties(); AppProps.setApplicationJarClass( props, Main.class ); HadoopFlowConnector flowConnector = new HadoopFlowConnector( props ); // create source and sink taps Tap docTap = new Hfs( new TextDelimited( true, "t" ), docPath ); Tap wcTap = new Hfs( new TextDelimited( true, "t" ), wcPath ); // specify a regex operation to split the "document" text lines into aken stream Fields token = new Fields( "token" ); Fields text = new Fields( "text" ); RegexSplitGenerator splitter = new RegexSplitGenerator( token, "[ [),.]" );
  57. 57. 2: Word CountString docPath = args[ 0 ];String wcPath = args[ 1 ];Properties properties = new Properties();AppProps.setApplicationJarClass( props, Main.class );HadoopFlowConnector flowConnector = new HadoopFlowConnector( props );// create source and sink tapsTap docTap = new Hfs( new TextDelimited( true, "t" ), docPath );Tap wcTap = new Hfs( new TextDelimited( true, "t" ), wcPath );Fields token = new Fields( "token" );Fields text = new Fields( "text" );RegexSplitGenerator splitter = new RegexSplitGenerator( token, "[ [](),.]" );// only returns "token"Pipe docPipe = new Each( "token", text, splitter, Fields.RESULTS );// determine the word countsPipe wcPipe = new Pipe( "wc", docPipe );wcPipe = new GroupBy( wcPipe, token );wcPipe = new Every( wcPipe, Fields.ALL, new Count(), Fields.ALL );
  58. 58. String wcPath = args[ 1 ]; 2: Word Count 2: Word CountProperties properties = new Properties();AppProps.setApplicationJarClass( props, Main.class );HadoopFlowConnector flowConnector = new HadoopFlowConnector( props );// create source and sink tapsTap docTap = new Hfs( new TextDelimited( true, "t" ), docPath );Tap wcTap = new Hfs( new TextDelimited( true, "t" ), wcPath );Fields token = new Fields( "token" );Fields text = new Fields( "text" );RegexSplitGenerator splitter = new RegexSplitGenerator( token, "[ [](),.]" );// only returns "token"Pipe docPipe = new Each( "token", text, splitter, Fields.RESULTS );// determine the word countsPipe wcPipe = new Pipe( "wc", docPipe );wcPipe = new GroupBy( wcPipe, token );wcPipe = new Every( wcPipe, Fields.ALL, new Count(), Fields.ALL );// connect the taps, pipes, etc., into a flowFlowDef flowDef = FlowDef.flowDef() .setName( "wc" )
  59. 59. AppProps.setApplicationJarClass( props, Main.class );HadoopFlowConnector flowConnector = new HadoopFlowConnector( props ); 2: Word Count// create source and sink tapsTap docTap = new Hfs( new TextDelimited( true, "t" ), docPath );Tap wcTap = new Hfs( new TextDelimited( true, "t" ), wcPath );Fields token = new Fields( "token" );Fields text = new Fields( "text" );RegexSplitGenerator splitter = new RegexSplitGenerator( token, "[ [](),.]" );// only returns "token"Pipe docPipe = new Each( "token", text, splitter, Fields.RESULTS );// determine the word countsPipe wcPipe = new Pipe( "wc", docPipe );wcPipe = new GroupBy( wcPipe, token );wcPipe = new Every( wcPipe, Fields.ALL, new Count(), Fields.ALL );// connect the taps, pipes, etc., into a flowFlowDef flowDef = FlowDef.flowDef() .setName( "wc" ) .addSource( docPipe, docTap ) .addTailSink( wcPipe, wcTap );// write a DOT file and run the flow
  60. 60. AppProps.setApplicationJarClass( props, Main.class );HadoopFlowConnector flowConnector = new HadoopFlowConnector( props ); 2: Word Count// create source and sink tapsTap docTap = new Hfs( new TextDelimited( true, "t" ), docPath );Tap wcTap = new Hfs( new TextDelimited( true, "t" ), wcPath );Fields token = new Fields( "token" );Fields text = new Fields( "text" );RegexSplitGenerator splitter = new RegexSplitGenerator( token, "[ [](),.]" );// only returns "token"Pipe docPipe = new Each( "token", text, splitter, Fields.RESULTS );// determine the word countsPipe wcPipe = new Pipe( "wc", docPipe );wcPipe = new GroupBy( wcPipe, token );wcPipe = new Every( wcPipe, Fields.ALL, new Count(), Fields.ALL );// connect the taps, pipes, etc., into a flowFlowDef flowDef = FlowDef.flowDef() .setName( "wc" ) .addSource( docPipe, docTap ) .addTailSink( wcPipe, wcTap );// write a DOT file and run the flow
  61. 61. AppProps.setApplicationJarClass( props, Main.class );HadoopFlowConnector flowConnector = new HadoopFlowConnector( props ); 2: Word Count// create source and sink tapsTap docTap = new Hfs( new TextDelimited( true, "t" ), docPath );Tap wcTap = new Hfs( new TextDelimited( true, "t" ), wcPath );Fields token = new Fields( "token" );Fields text = new Fields( "text" );RegexSplitGenerator splitter = new RegexSplitGenerator( token, "[ [](),.]" );// only returns "token"Pipe docPipe = new Each( "token", text, splitter, Fields.RESULTS );// determine the word countsPipe wcPipe = new Pipe( "wc", docPipe );wcPipe = new GroupBy( wcPipe, token );wcPipe = new Every( wcPipe, Fields.ALL, new Count(), Fields.ALL );// connect the taps, pipes, etc., into a flowFlowDef flowDef = FlowDef.flowDef() .setName( "wc" ) .addSource( docPipe, docTap ) .addTailSink( wcPipe, wcTap );// write a DOT file and run the flow
  62. 62. AppProps.setApplicationJarClass( props, Main.class );HadoopFlowConnector flowConnector = new HadoopFlowConnector( props ); 2: Word Count// create source and sink tapsTap docTap = new Hfs( new TextDelimited( true, "t" ), docPath );Tap wcTap = new Hfs( new TextDelimited( true, "t" ), wcPath );Fields token = new Fields( "token" );Fields text = new Fields( "text" );RegexSplitGenerator splitter = new RegexSplitGenerator( token, "[ [](),.]" );// only returns "token"Pipe docPipe = new Each( "token", text, splitter, Fields.RESULTS );// determine the word countsPipe wcPipe = new Pipe( "wc", docPipe );wcPipe = new GroupBy( wcPipe, token );wcPipe = new Every( wcPipe, Fields.ALL, new Count(), Fields.ALL );// connect the taps, pipes, etc., into a flowFlowDef flowDef = FlowDef.flowDef() .setName( "wc" ) .addSource( docPipe, docTap ) .addTailSink( wcPipe, wcTap );// write a DOT file and run the flow
  63. 63. AppProps.setApplicationJarClass( props, Main.class );HadoopFlowConnector flowConnector = new HadoopFlowConnector( props ); 2: Word Count// create source and sink tapsTap docTap = new Hfs( new TextDelimited( true, "t" ), docPath );Tap wcTap = new Hfs( new TextDelimited( true, "t" ), wcPath );Fields token = new Fields( "token" );Fields text = new Fields( "text" );RegexSplitGenerator splitter = new RegexSplitGenerator( token, "[ [](),.]" );// only returns "token"Pipe docPipe = new Each( "token", text, splitter, Fields.RESULTS );// determine the word countsPipe wcPipe = new Pipe( "wc", docPipe );wcPipe = new GroupBy( wcPipe, token );wcPipe = new Every( wcPipe, Fields.ALL, new Count(), Fields.ALL );// connect the taps, pipes, etc., into a flowFlowDef flowDef = FlowDef.flowDef() .setName( "wc" ) .addSource( docPipe, docTap ) .addTailSink( wcPipe, wcTap );// write a DOT file and run the flow
  64. 64. AppProps.setApplicationJarClass( props, Main.class );HadoopFlowConnector flowConnector = new HadoopFlowConnector( props ); 2: Word Count// create source and sink tapsTap docTap = new Hfs( new TextDelimited( true, "t" ), docPath );Tap wcTap = new Hfs( new TextDelimited( true, "t" ), wcPath );Fields token = new Fields( "token" );Fields text = new Fields( "text" );RegexSplitGenerator splitter = new RegexSplitGenerator( token, "[ [](),.]" );// only returns "token"Pipe docPipe = new Each( "token", text, splitter, Fields.RESULTS );// determine the word countsPipe wcPipe = new Pipe( "wc", docPipe );wcPipe = new GroupBy( wcPipe, token );wcPipe = new Every( wcPipe, Fields.ALL, new Count(), Fields.ALL );// connect the taps, pipes, etc., into a flowFlowDef flowDef = FlowDef.flowDef() .setName( "wc" ) .addSource( docPipe, docTap ) .addTailSink( wcPipe, wcTap );// write a DOT file and run the flow
  65. 65. AppProps.setApplicationJarClass( props, Main.class );HadoopFlowConnector flowConnector = new HadoopFlowConnector( props ); 2: Word Count// create source and sink tapsTap docTap = new Hfs( new TextDelimited( true, "t" ), docPath );Tap wcTap = new Hfs( new TextDelimited( true, "t" ), wcPath );Fields token = new Fields( "token" );Fields text = new Fields( "text" );RegexSplitGenerator splitter = new RegexSplitGenerator( token, "[ [](),.]" );// only returns "token"Pipe docPipe = new Each( "token", text, splitter, Fields.RESULTS );// determine the word countsPipe wcPipe = new Pipe( "wc", docPipe );wcPipe = new GroupBy( wcPipe, token );wcPipe = new Every( wcPipe, Fields.ALL, new Count(), Fields.ALL );// connect the taps, pipes, etc., into a flowFlowDef flowDef = FlowDef.flowDef() .setName( "wc" ) .addSource( docPipe, docTap ) .addTailSink( wcPipe, wcTap );// write a DOT file and run the flow
  66. 66. Fields token = new Fields( "token" ); 2: Word Count Fields text = new Fields( "text" ); RegexSplitGenerator splitter = new RegexSplitGenerator( token, "[ [](),.]" ); // only returns "token" Pipe docPipe = new Each( "token", text, splitter, Fields.RESULTS ); // determine the word counts Pipe wcPipe = new Pipe( "wc", docPipe ); wcPipe = new GroupBy( wcPipe, token ); wcPipe = new Every( wcPipe, Fields.ALL, new Count(), Fields.ALL ); // connect the taps, pipes, etc., into a flow FlowDef flowDef = FlowDef.flowDef() .setName( "wc" ) .addSource( docPipe, docTap ) .addTailSink( wcPipe, wcTap ); // write a DOT file and run the flow Flow wcFlow = flowConnector.connect( flowDef ); wcFlow.writeDOT( "dot/wc.dot" ); wcFlow.complete(); }}
  67. 67. Pipe docPipe = new Each( "token", text, splitter, Fields.RESULTS ); 2: Word Count How its made // determine the word counts Pipe wcPipe = new Pipe( "wc", docPipe ); wcPipe = new GroupBy( wcPipe, token ); wcPipe = new Every( wcPipe, Fields.ALL, new Count(), Fields.ALL ); // connect the taps, pipes, etc., into a flow FlowDef flowDef = FlowDef.flowDef() .setName( "wc" ) .addSource( docPipe, docTap ) .addTailSink( wcPipe, wcTap ); // write a DOT file and run the flow Flow wcFlow = flowConnector.connect( flowDef ); wcFlow.writeDOT( "dot/wc.dot" ); wcFlow.complete(); }}
  68. 68. Pipe docPipe = new Each( "token", text, splitter, Fields.RESULTS ); 2: Word Count How its made // determine the word counts Pipe wcPipe = new Pipe( "wc", docPipe ); wcPipe = new GroupBy( wcPipe, token ); wcPipe = new Every( wcPipe, Fields.ALL, new Count(), Fields.ALL ); // connect the taps, pipes, etc., into a flow FlowDef flowDef = FlowDef.flowDef() .setName( "wc" ) .addSource( docPipe, docTap ) .addTailSink( wcPipe, wcTap ); // write a DOT file and run the flow Flow wcFlow = flowConnector.connect( flowDef ); wcFlow.writeDOT( "dot/wc.dot" ); wcFlow.complete(); } Graph representation of jobs!}
  69. 69. 2: Word CountHow its madehttp://www.cascading.org/2012/07/09/cascading-for-the-impatient-part-2/
  70. 70. How its made
  71. 71. How its madeval flow = FlowDef
  72. 72. How its madeval flow = FlowDef// pseudo code...
  73. 73. How its madeval flow = FlowDef// pseudo code...val jobs: List[MRJob] = flowConnector(flow)
  74. 74. How its madeval flow = FlowDef// pseudo code...val jobs: List[MRJob] = flowConnector(flow)// pseudo code...
  75. 75. How its madeval flow = FlowDef// pseudo code...val jobs: List[MRJob] = flowConnector(flow)// pseudo code...HadoopCluster.execute(jobs)
  76. 76. How its madeval flow = FlowDef// pseudo code...val jobs: List[MRJob] = flowConnector(flow)// pseudo code...HadoopCluster.execute(jobs)
  77. 77. Cascading tipsPipe assembly = new Pipe( "assembly" );assembly = new Each( assembly, DebugLevel.VERBOSE, new Debug() );// ...// head and tail have same nameFlowDef flowDef = new FlowDef() .setName( "debug" ) .addSource( "assembly", source ) .addSink( "assembly", sink ) .addTail( assembly );
  78. 78. Cascading tipsPipe assembly = new Pipe( "assembly" );assembly = new Each( assembly, DebugLevel.VERBOSE, new Debug() );// ...// head and tail have same nameFlowDef flowDef = new FlowDef() .setName( "debug" ) .addSource( "assembly", source ) .addSink( "assembly", sink ) .addTail( assembly );flowDef.setDebugLevel( DebugLevel.NONE );
  79. 79. Cascading tipsPipe assembly = new Pipe( "assembly" );assembly = new Each( assembly, DebugLevel.VERBOSE, new Debug() );// ...// head and tail have same nameFlowDef flowDef = new FlowDef() .setName( "debug" ) .addSource( "assembly", source ) .addSink( "assembly", sink ) .addTail( assembly );flowDef.setDebugLevel( DebugLevel.NONE ); flowConnector will NOT create the Debug pipe!
  80. 80. Scalding = + Twitter Scaldinggithub.com/twitter/scalding
  81. 81. Scalding API
  82. 82. map
  83. 83. mapScala:val data = 1 :: 2 :: 3 :: Nil
  84. 84. mapScala:val data = 1 :: 2 :: 3 :: Nilval doubled = data map { _ * 2 }
  85. 85. mapScala:val data = 1 :: 2 :: 3 :: Nilval doubled = data map { _ * 2 } // Int => Int
  86. 86. mapScala:val data = 1 :: 2 :: 3 :: Nilval doubled = data map { _ * 2 } // Int => Int
  87. 87. map Scala: val data = 1 :: 2 :: 3 :: Nil val doubled = data map { _ * 2 } // Int => IntScalding: IterableSource(data)
  88. 88. map Scala: val data = 1 :: 2 :: 3 :: Nil val doubled = data map { _ * 2 } // Int => IntScalding: IterableSource(data) .map(number -> doubled) { n: Int => n * 2 }
  89. 89. map Scala: val data = 1 :: 2 :: 3 :: Nil val doubled = data map { _ * 2 } // Int => IntScalding: IterableSource(data) .map(number -> doubled) { n: Int => n * 2 } // Int => Int
  90. 90. map Scala: val data = 1 :: 2 :: 3 :: Nil val doubled = data map { _ * 2 } // Int => IntScalding: IterableSource(data) .map(number -> doubled) { n: Int => n * 2 } available in Pipe // Int => Int
  91. 91. map Scala: val data = 1 :: 2 :: 3 :: Nil val doubled = data map { _ * 2 } // Int => IntScalding: IterableSource(data) .map(number -> doubled) { n: Int => n * 2 } stays in Pipe available in Pipe // Int => Int
  92. 92. map Scala: val data = 1 :: 2 :: 3 :: Nil val doubled = data map { _ * 2 } // Int => IntScalding: IterableSource(data) .map(number -> doubled) { n: Int => n * 2 } must choose type! // Int => Int
  93. 93. mapTo
  94. 94. mapToScala:var data = 1 :: 2 :: 3 :: Nil
  95. 95. mapToScala:var data = 1 :: 2 :: 3 :: Nilval doubled = data map { _ * 2 }
  96. 96. mapToScala:var data = 1 :: 2 :: 3 :: Nilval doubled = data map { _ * 2 }data = null
  97. 97. mapToScala:var data = 1 :: 2 :: 3 :: Nilval doubled = data map { _ * 2 }data = null // Int => Int
  98. 98. mapToScala:var data = 1 :: 2 :: 3 :: Nilval doubled = data map { _ * 2 }data = null // Int => Int release reference
  99. 99. mapToScala:var data = 1 :: 2 :: 3 :: Nilval doubled = data map { _ * 2 }data = null // Int => Int release reference
  100. 100. mapTo Scala: var data = 1 :: 2 :: 3 :: Nil val doubled = data map { _ * 2 } data = null // Int => Int release referenceScalding: IterableSource(data)
  101. 101. mapTo Scala: var data = 1 :: 2 :: 3 :: Nil val doubled = data map { _ * 2 } data = null // Int => Int release referenceScalding: IterableSource(data) .mapTo(doubled) { n: Int => n * 2 }
  102. 102. mapTo Scala: var data = 1 :: 2 :: 3 :: Nil val doubled = data map { _ * 2 } data = null // Int => Int release referenceScalding: IterableSource(data) .mapTo(doubled) { n: Int => n * 2 } // Int => Int
  103. 103. mapTo Scala: var data = 1 :: 2 :: 3 :: Nil val doubled = data map { _ * 2 } data = null // Int => Int release referenceScalding: IterableSource(data) .mapTo(doubled) { n: Int => n * 2 } doubled stays in Pipe // Int => Int
  104. 104. mapTo Scala: var data = 1 :: 2 :: 3 :: Nil val doubled = data map { _ * 2 } data = null // Int => Int release referenceScalding: IterableSource(data) .mapTo(doubled) { n: Int => n * 2 } number is removed doubled stays in Pipe // Int => Int
  105. 105. flatMap
  106. 106. flatMapScala:val data = "1" :: "2,2" :: "3,3,3" :: Nil // List[String]
  107. 107. flatMapScala:val data = "1" :: "2,2" :: "3,3,3" :: Nil // List[String]val numbers = data flatMap { line => // String
  108. 108. flatMapScala:val data = "1" :: "2,2" :: "3,3,3" :: Nil // List[String]val numbers = data flatMap { line => // String line.split(",") // Array[String]
  109. 109. flatMapScala:val data = "1" :: "2,2" :: "3,3,3" :: Nil // List[String]val numbers = data flatMap { line => // String line.split(",") // Array[String]} map { _.toInt } // List[Int]
  110. 110. flatMapScala:val data = "1" :: "2,2" :: "3,3,3" :: Nil // List[String]val numbers = data flatMap { line => // String line.split(",") // Array[String]} map { _.toInt } // List[Int]numbers // List[Int]
  111. 111. flatMapScala:val data = "1" :: "2,2" :: "3,3,3" :: Nil // List[String]val numbers = data flatMap { line => // String line.split(",") // Array[String]} map { _.toInt } // List[Int]numbers // List[Int]numbers should equal (List(1, 2, 2, 3, 3, 3))
  112. 112. flatMapScala:val data = "1" :: "2,2" :: "3,3,3" :: Nil // List[String]val numbers = data flatMap { line => // String line.split(",") // Array[String]} map { _.toInt } // List[Int]numbers // List[Int]numbers should equal (List(1, 2, 2, 3, 3, 3))
  113. 113. flatMap Scala: val data = "1" :: "2,2" :: "3,3,3" :: Nil // List[String] val numbers = data flatMap { line => // String line.split(",") // Array[String] } map { _.toInt } // List[Int] numbers // List[Int] numbers should equal (List(1, 2, 2, 3, 3, 3))Scalding: TextLine(data) // like List[String]
  114. 114. flatMap Scala: val data = "1" :: "2,2" :: "3,3,3" :: Nil // List[String] val numbers = data flatMap { line => // String line.split(",") // Array[String] } map { _.toInt } // List[Int] numbers // List[Int] numbers should equal (List(1, 2, 2, 3, 3, 3))Scalding: TextLine(data) // like List[String] .flatMap(line -> word) { _.split(",") } // like List[String]
  115. 115. flatMap Scala: val data = "1" :: "2,2" :: "3,3,3" :: Nil // List[String] val numbers = data flatMap { line => // String line.split(",") // Array[String] } map { _.toInt } // List[Int] numbers // List[Int] numbers should equal (List(1, 2, 2, 3, 3, 3))Scalding: TextLine(data) // like List[String] .flatMap(line -> word) { _.split(",") } // like List[String] .map(word -> number) { _.toInt } // like List[Int]
  116. 116. flatMap Scala: val data = "1" :: "2,2" :: "3,3,3" :: Nil // List[String] val numbers = data flatMap { line => // String line.split(",") // Array[String] } map { _.toInt } // List[Int] numbers // List[Int] numbers should equal (List(1, 2, 2, 3, 3, 3))Scalding: TextLine(data) // like List[String] .flatMap(line -> word) { _.split(",") } // like List[String] .map(word -> number) { _.toInt } // like List[Int] MR map outside
  117. 117. flatMap
  118. 118. flatMapScala:val data = "1" :: "2,2" :: "3,3,3" :: Nil // List[String]
  119. 119. flatMapScala:val data = "1" :: "2,2" :: "3,3,3" :: Nil // List[String]val numbers = data flatMap { line => // String
  120. 120. flatMapScala:val data = "1" :: "2,2" :: "3,3,3" :: Nil // List[String]val numbers = data flatMap { line => // String line.split(",").map(_.toInt) // Array[Int]
  121. 121. flatMapScala:val data = "1" :: "2,2" :: "3,3,3" :: Nil // List[String]val numbers = data flatMap { line => // String line.split(",").map(_.toInt) // Array[Int]}
  122. 122. flatMapScala:val data = "1" :: "2,2" :: "3,3,3" :: Nil // List[String]val numbers = data flatMap { line => // String line.split(",").map(_.toInt) // Array[Int]}numbers // List[Int]
  123. 123. flatMapScala:val data = "1" :: "2,2" :: "3,3,3" :: Nil // List[String]val numbers = data flatMap { line => // String line.split(",").map(_.toInt) // Array[Int]}numbers // List[Int]numbers should equal (List(1, 2, 2, 3, 3, 3))
  124. 124. flatMapScala:val data = "1" :: "2,2" :: "3,3,3" :: Nil // List[String]val numbers = data flatMap { line => // String line.split(",").map(_.toInt) // Array[Int]}numbers // List[Int]numbers should equal (List(1, 2, 2, 3, 3, 3))
  125. 125. flatMap Scala: val data = "1" :: "2,2" :: "3,3,3" :: Nil // List[String] val numbers = data flatMap { line => // String line.split(",").map(_.toInt) // Array[Int] } numbers // List[Int] numbers should equal (List(1, 2, 2, 3, 3, 3))Scalding: TextLine(data) // like List[String]
  126. 126. flatMap Scala: val data = "1" :: "2,2" :: "3,3,3" :: Nil // List[String] val numbers = data flatMap { line => // String line.split(",").map(_.toInt) // Array[Int] } numbers // List[Int] numbers should equal (List(1, 2, 2, 3, 3, 3))Scalding: TextLine(data) // like List[String] .flatMap(line -> word) { _.split(",").map(_.toInt) }
  127. 127. flatMap Scala: val data = "1" :: "2,2" :: "3,3,3" :: Nil // List[String] val numbers = data flatMap { line => // String line.split(",").map(_.toInt) // Array[Int] } numbers // List[Int] numbers should equal (List(1, 2, 2, 3, 3, 3))Scalding: TextLine(data) // like List[String] .flatMap(line -> word) { _.split(",").map(_.toInt) } // like List[Int]
  128. 128. flatMap Scala: val data = "1" :: "2,2" :: "3,3,3" :: Nil // List[String] val numbers = data flatMap { line => // String line.split(",").map(_.toInt) // Array[Int] } numbers // List[Int] numbers should equal (List(1, 2, 2, 3, 3, 3))Scalding: TextLine(data) // like List[String] .flatMap(line -> word) { _.split(",").map(_.toInt) } // like List[Int] map inside Scala
  129. 129. groupBy
  130. 130. groupByScala:val data = 1 :: 2 :: 30 :: 42 :: Nil // List[Int]
  131. 131. groupByScala:val data = 1 :: 2 :: 30 :: 42 :: Nil // List[Int]val groups = data groupBy { _ < 10 }
  132. 132. groupByScala:val data = 1 :: 2 :: 30 :: 42 :: Nil // List[Int]val groups = data groupBy { _ < 10 }groups // Map[Boolean, Int]
  133. 133. groupByScala:val data = 1 :: 2 :: 30 :: 42 :: Nil // List[Int]val groups = data groupBy { _ < 10 }groups // Map[Boolean, Int]groups(true) should equal (List(1, 2))
  134. 134. groupByScala:val data = 1 :: 2 :: 30 :: 42 :: Nil // List[Int]val groups = data groupBy { _ < 10 }groups // Map[Boolean, Int]groups(true) should equal (List(1, 2))groups(false) should equal (List(30, 42))
  135. 135. groupByScala:val data = 1 :: 2 :: 30 :: 42 :: Nil // List[Int]val groups = data groupBy { _ < 10 }groups // Map[Boolean, Int]groups(true) should equal (List(1, 2))groups(false) should equal (List(30, 42))
  136. 136. groupBy Scala: val data = 1 :: 2 :: 30 :: 42 :: Nil // List[Int] val groups = data groupBy { _ < 10 } groups // Map[Boolean, Int] groups(true) should equal (List(1, 2)) groups(false) should equal (List(30, 42))Scalding: IterableSource(List(1, 2, 30, 42), num)
  137. 137. groupBy Scala: val data = 1 :: 2 :: 30 :: 42 :: Nil // List[Int] val groups = data groupBy { _ < 10 } groups // Map[Boolean, Int] groups(true) should equal (List(1, 2)) groups(false) should equal (List(30, 42))Scalding: IterableSource(List(1, 2, 30, 42), num) .map(num -> lessThanTen) { i: Int => i < 10 }
  138. 138. groupBy Scala: val data = 1 :: 2 :: 30 :: 42 :: Nil // List[Int] val groups = data groupBy { _ < 10 } groups // Map[Boolean, Int] groups(true) should equal (List(1, 2)) groups(false) should equal (List(30, 42))Scalding: IterableSource(List(1, 2, 30, 42), num) .map(num -> lessThanTen) { i: Int => i < 10 } .groupBy(lessThanTen) { _.size(size) }
  139. 139. groupBy Scala: val data = 1 :: 2 :: 30 :: 42 :: Nil // List[Int] val groups = data groupBy { _ < 10 } groups // Map[Boolean, Int] groups(true) should equal (List(1, 2)) groups(false) should equal (List(30, 42))Scalding: IterableSource(List(1, 2, 30, 42), num) .map(num -> lessThanTen) { i: Int => i < 10 } .groupBy(lessThanTen) { _.size(size) } groups all with == value
  140. 140. groupBy Scala: val data = 1 :: 2 :: 30 :: 42 :: Nil // List[Int] val groups = data groupBy { _ < 10 } groups // Map[Boolean, Int] groups(true) should equal (List(1, 2)) groups(false) should equal (List(30, 42))Scalding: IterableSource(List(1, 2, 30, 42), num) .map(num -> lessThanTen) { i: Int => i < 10 } .groupBy(lessThanTen) { _.size(size) } groups all with == value => size
  141. 141. groupByScalding:
  142. 142. groupByScalding: IterableSource(List(1, 2, 30, 42), num)
  143. 143. groupByScalding: IterableSource(List(1, 2, 30, 42), num) .map(num -> lessThanTen) { i: Int => i < 10 }
  144. 144. groupByScalding: IterableSource(List(1, 2, 30, 42), num) .map(num -> lessThanTen) { i: Int => i < 10 } .groupBy(lessThanTen) { _.sum(total) }
  145. 145. groupByScalding: IterableSource(List(1, 2, 30, 42), num) .map(num -> lessThanTen) { i: Int => i < 10 } .groupBy(lessThanTen) { _.sum(total) } total = [3, 74]
  146. 146. Scalding API
  147. 147. Scalding API project / discard
  148. 148. Scalding API project / discard map / mapTo
  149. 149. Scalding API project / discard map / mapTo flatMap / flatMapTo
  150. 150. Scalding API project / discard map / mapTo flatMap / flatMapTo rename
  151. 151. Scalding API project / discard map / mapTo flatMap / flatMapTo rename filter
  152. 152. Scalding API project / discard map / mapTo flatMap / flatMapTo rename filter unique
  153. 153. Scalding API project / discard map / mapTo flatMap / flatMapTo rename filter uniquegroupBy / groupAll / groupRandom / shuffle
  154. 154. Scalding API project / discard map / mapTo flatMap / flatMapTo rename filter uniquegroupBy / groupAll / groupRandom / shuffle limit
  155. 155. Scalding API project / discard map / mapTo flatMap / flatMapTo rename filter uniquegroupBy / groupAll / groupRandom / shuffle limit debug
  156. 156. Scalding API project / discard map / mapTo flatMap / flatMapTo rename filter uniquegroupBy / groupAll / groupRandom / shuffle limit debug Group operations
  157. 157. Scalding API project / discard map / mapTo flatMap / flatMapTo rename filter uniquegroupBy / groupAll / groupRandom / shuffle limit debug Group operations joins
  158. 158. Distributed Copy in Scaldingclass WordCountJob(args: Args) extends Job(args) {
  159. 159. Distributed Copy in Scaldingclass WordCountJob(args: Args) extends Job(args) { val input = Tsv(args("input")) val output = Tsv(args("output"))
  160. 160. Distributed Copy in Scaldingclass WordCountJob(args: Args) extends Job(args) { val input = Tsv(args("input")) val output = Tsv(args("output")) input.read.write(output)}
  161. 161. Distributed Copy in Scaldingclass WordCountJob(args: Args) extends Job(args) { val input = Tsv(args("input")) val output = Tsv(args("output")) input.read.write(output)} The End.
  162. 162. Main Class - "Runner"import org.apache.hadoop.util.ToolRunnerimport com.twitter.scaldingobject ScaldingJobRunner extends App { ToolRunner.run(new Configuration, new scalding.Tool, args)}
  163. 163. Main Class - "Runner"import org.apache.hadoop.util.ToolRunnerimport com.twitter.scaldingobject ScaldingJobRunner extends App { from App ToolRunner.run(new Configuration, new scalding.Tool, args)}
  164. 164. Word Count in Scaldingclass WordCountJob(args: Args) extends Job(args) {}
  165. 165. Word Count in Scaldingclass WordCountJob(args: Args) extends Job(args) { val inputFile = args("input") val outputFile = args("output")}
  166. 166. Word Count in Scaldingclass WordCountJob(args: Args) extends Job(args) { val inputFile = args("input") val outputFile = args("output") TextLine(inputFile)}
  167. 167. Word Count in Scaldingclass WordCountJob(args: Args) extends Job(args) { val inputFile = args("input") val outputFile = args("output") TextLine(inputFile) .flatMap(line -> word) { line: String => tokenize(line) } def tokenize(text: String): Array[String] = implemented}
  168. 168. Word Count in Scaldingclass WordCountJob(args: Args) extends Job(args) { val inputFile = args("input") val outputFile = args("output") TextLine(inputFile) .flatMap(line -> word) { line: String => tokenize(line) } .groupBy(word) { group => group.size(count) } def tokenize(text: String): Array[String] = implemented}
  169. 169. Word Count in Scaldingclass WordCountJob(args: Args) extends Job(args) { val inputFile = args("input") val outputFile = args("output") TextLine(inputFile) .flatMap(line -> word) { line: String => tokenize(line) } .groupBy(word) { group => group.size } def tokenize(text: String): Array[String] = implemented}
  170. 170. Word Count in Scaldingclass WordCountJob(args: Args) extends Job(args) { val inputFile = args("input") val outputFile = args("output") TextLine(inputFile) .flatMap(line -> word) { line: String => tokenize(line) } .groupBy(word) { _.size } def tokenize(text: String): Array[String] = implemented}
  171. 171. Word Count in Scaldingclass WordCountJob(args: Args) extends Job(args) { val inputFile = args("input") val outputFile = args("output") TextLine(inputFile) .flatMap(line -> word) { line: String => tokenize(line) } .groupBy(word) { _.size } .write(Tsv(outputFile)) def tokenize(text: String): Array[String] = implemented}
  172. 172. Word Count in Scalding class WordCountJob(args: Args) extends Job(args) { val inputFile = args("input") val outputFile = args("output")4{ TextLine(inputFile) .flatMap(line -> word) { line: String => tokenize(line) } .groupBy(word) { _.size } .write(Tsv(outputFile)) def tokenize(text: String): Array[String] = implemented }
  173. 173. Word Count in Scalding
  174. 174. Word Count in Scaldingrun pl.project13.scala.oculus.job.WordCountJob --tool.graph
  175. 175. Word Count in Scaldingrun pl.project13.scala.oculus.job.WordCountJob --tool.graph=> pl.project13.scala.oculus.job.WordCountJob0.dot
  176. 176. Word Count in Scaldingrun pl.project13.scala.oculus.job.WordCountJob --tool.graph=> pl.project13.scala.oculus.job.WordCountJob0.dotMAP
  177. 177. Word Count in Scaldingrun pl.project13.scala.oculus.job.WordCountJob --tool.graph=> pl.project13.scala.oculus.job.WordCountJob0.dotMAPRED
  178. 178. Word Count in ScaldingTextLine(inputFile) .flatMap(line -> word) { line: String => tokenize(line) } .groupBy(word) { _.size(count) } .write(Tsv(outputFile))
  179. 179. Word Count in ScaldingTextLine(inputFile) .flatMap(line -> word) { line: String => tokenize(line) } .groupBy(word) { _.size(count) } .write(Tsv(outputFile))
  180. 180. Word Count in ScaldingTextLine(inputFile) .flatMap(line -> word) { line: String => tokenize(line) } .groupBy(word) { _.size(count) } .write(Tsv(outputFile))
  181. 181. Why Scalding?
  182. 182. Why Scalding? Hadoop inside
  183. 183. Why Scalding? Hadoop insideCascading abstractions
  184. 184. Why Scalding? Hadoop insideCascading abstractions Scala conciseness
  185. 185. Ask Stuff! Dzięki! Thanks! ありがとう!Konrad Malawski @ java.plt: ktosopl / g: ktosob: blog.project13.pl
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×