Writing Hadoop Jobs in Scala using Scalding

2,323 views
2,121 views

Published on

Talk that I gave at the #BcnDevCon13 about using Scalding and the strong points of using Scala for Big Data data processing

Published in: Technology

Writing Hadoop Jobs in Scala using Scalding

  1. 1. Writing Hadoop Jobs in Scala using @tonicebrian Scalding
  2. 2. How much storage can 100$ dollars buy you?
  3. 3. How much storage can 100$ dollars buy you? 1 photo 1980
  4. 4. How much storage can 100$ dollars buy you? 1 photo 5 songs 1980 1990
  5. 5. How much storage can 100$ dollars buy you? 1 photo 5 songs 7 movies 1980 1990 2000
  6. 6. How much storage can 100$ dollars buy you? 600 movies 170.000 songs 1 photo 5 songs 7 movies 1980 1990 2000 5 million photos 2010
  7. 7. From single drives…
  8. 8. From single drives… to clusters…
  9. 9. Data Science
  10. 10. “A mathematician is a device for turning coffee into theorems” Alfréd Rényi
  11. 11. data scientist “A mathematician is a device for turning coffee into theorems” Alfréd Rényi
  12. 12. data scientist “A mathematician is a device for turning coffee and into theorems” data Alfréd Rényi
  13. 13. data scientist “A mathematician is a device for turning coffee and into theorems” insights data Alfréd Rényi
  14. 14. Hadoop = Map Distributed + File System Reduce
  15. 15. Hadoop Storage = Map Distributed + File System Reduce
  16. 16. Hadoop Program Model = Storage Map Distributed + File System Reduce
  17. 17. Word Count Raw Hello cruel world Say hello! Hello!
  18. 18. Word Count Raw Map hello Hello cruel world Say hello! Hello! 1 cruel 1 world 1 say 1 hello 2
  19. 19. Word Count Raw Map Reduce hello hello Hello cruel world 1 2 cruel 1 world 1 say 1 Say hello! Hello!
  20. 20. Word Count Raw Map Reduce Result hello 3 Hello cruel world cruel 1 Say hello! Hello! world 1 say 1
  21. 21. 4 Main Characteristics of Scala
  22. 22. 4 Main Characteristics of Scala JVM
  23. 23. 4 Main Characteristics of Scala JVM Statically Typed
  24. 24. 4 Main Characteristics of Scala JVM Object Oriented Statically Typed
  25. 25. 4 Main Characteristics of Scala JVM Statically Typed Object Oriented Functional Programming
  26. 26. def map[B](f: (A) ⇒ B): List[B] Builds a new collection by applying a function to all elements of this list. def reduce[A1 >: A](op: (A1, A1) ⇒ A1): A1 Reduces the elements of this list using the specified associative binary operator.
  27. 27. Recap
  28. 28. Recap Map/Reduce • Programming paradigm that employs concepts from Functional Programming
  29. 29. Recap Map/Reduce • Programming paradigm that employs concepts from Functional Programming Scala • Map/Reduce • Functional Language that runs on the JVM
  30. 30. Recap Map/Reduce • Programming paradigm that employs concepts from Functional Programming Scala • Map/Reduce • Functional Language that runs on the JVM Hadoop • Open Source Implementation of MR in the JVM
  31. 31. So in what language is Hadoop implemented?
  32. 32. The Result?
  33. 33. The Result? package org.myorg; import java.io.IOException; import java.util.*; import org.apache.hadoop.fs.Path; import org.apache.hadoop.conf.*; import org.apache.hadoop.io.*; import org.apache.hadoop.mapreduce.*; import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import org.apache.hadoop.mapreduce.lib.input.TextInputFormat; import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat; public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { int sum = 0; for (IntWritable val : values) { sum += val.get(); } context.write(key, new IntWritable(sum)); } } public static void main(String[] args) throws Exception { Configuration conf = new Configuration(); Job job = new Job(conf, "wordcount"); public class WordCount { job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { String line = value.toString(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasMoreTokens()) { word.set(tokenizer.nextToken()); context.write(word, one); } } } job.setMapperClass(Map.class); job.setReducerClass(Reduce.class); job.setInputFormatClass(TextInputFormat.class); job.setOutputFormatClass(TextOutputFormat.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); job.waitForCompletion(true); } }
  34. 34. High level approaches SQL Data Transformations
  35. 35. High level approaches input_lines = LOAD ‘myfile.txt' AS (line:chararray); words = FOREACH input_lines GENERATE FLATTEN(TOKENIZE(line)) AS word; filtered_words = FILTER words BY word MATCHES 'w+'; word_groups = GROUP filtered_words BY word; word_count = FOREACH word_groups GENERATE COUNT(filtered_words) AS count, group AS word; ordered_word_count = ORDER word_count BY count DESC; STORE ordered_word_count INTO '/tmp/number-of-words-on-internet';
  36. 36. User defined functions (UDF) -- myscript.pig REGISTER myudfs.jar; A = LOAD 'student_data' AS (name: chararray, age: int, gpa: float); B = FOREACH A GENERATE myudfs.UPPER(name); package myudfs; import java.io.IOException; DUMP B; import org.apache.pig.EvalFunc; Java Pig import org.apache.pig.data.Tuple; import org.apache.pig.impl.util.WrappedIOException; public class UPPER extends EvalFunc<String> { public String exec(Tuple input) throws IOException { if (input == null || input.size() == 0) return null; try{ String str = (String)input.get(0); return str.toUpperCase(); }catch(Exception e){ throw WrappedIOException.wrap("Caught exception processing input row ", e); } } }
  37. 37. WordCount in Cascading package impatient; import java.util.Properties; import cascading.flow.Flow; import cascading.flow.FlowDef; import cascading.flow.hadoop.HadoopFlowConnector; import cascading.operation.aggregator.Count; import cascading.operation.regex.RegexFilter; import cascading.operation.regex.RegexSplitGenerator; import cascading.pipe.Each; import cascading.pipe.Every; import cascading.pipe.GroupBy; import cascading.pipe.Pipe; import cascading.property.AppProps; import cascading.scheme.Scheme; import cascading.scheme.hadoop.TextDelimited; import cascading.tap.Tap; import cascading.tap.hadoop.Hfs; import cascading.tuple.Fields; public class Main { public static void main( String[] args ) { String docPath = args[ 0 ]; String wcPath = args[ 1 ]; ); // create source and sink taps Tap docTap = new Hfs( new TextDelimited( true, "t" ), docPath ); Tap wcTap = new Hfs( new TextDelimited( true, "t" ), wcPath ); // specify a regex operation to split the "document" text lines into a token stream Fields token = new Fields( "token" ); Fields text = new Fields( "text" ); RegexSplitGenerator splitter = new RegexSplitGenerator( token, "[ [](),.]" ); // only returns "token" Pipe docPipe = new Each( "token", text, splitter, Fields.RESULTS ); // determine the word counts Pipe wcPipe = new Pipe( "wc", docPipe ); wcPipe = new GroupBy( wcPipe, token ); wcPipe = new Every( wcPipe, Fields.ALL, new Count(), Fields.ALL ); // connect the taps, pipes, etc., into a flow FlowDef flowDef = FlowDef.flowDef() .setName( "wc" ) .addSource( docPipe, docTap ) .addTailSink( wcPipe, wcTap ); // write a DOT file and run the flow Flow wcFlow = flowConnector.connect( flowDef ); Properties properties = new Properties(); wcFlow.writeDOT( "dot/wc.dot" ); AppProps.setApplicationJarClass( properties, Main.class ); wcFlow.complete(); HadoopFlowConnector flowConnector = new HadoopFlowConnector( properties} }
  38. 38. Good parts • Data Flow Programming Model • User Defined Functions
  39. 39. Good parts • Data Flow Programming Model • User Defined Functions Bad • Still Java • Objects for Flows
  40. 40. package com.twitter.scalding.examples import com.twitter.scalding._ class WordCountJob(args : Args) extends Job(args) { TextLine( args("input") ) .flatMap('line -> 'word) { line : String => tokenize(line) } .groupBy('word) { _.size } .write( Tsv( args("output") ) ) // Split a piece of text into individual words. def tokenize(text : String) : Array[String] = { // Lowercase each word and remove punctuation. text.toLowerCase.replaceAll("[^a-zA-Z0-9s]", "").split("s+") } }
  41. 41. TDD Cycle Red Refactor Green
  42. 42. Broader view Red … Refactor Continuous Deployment Green Acceptance Testing Unit Testing Lean Startup
  43. 43. Big Data Big Speed
  44. 44. A typical day working with Hadoop
  45. 45. A typical day working with Hadoop
  46. 46. A typical day working with Hadoop
  47. 47. A typical day working with Hadoop
  48. 48. A typical day working with Hadoop
  49. 49. A typical day working with Hadoop
  50. 50. A typical day working with Hadoop
  51. 51. A typical day working with Hadoop
  52. 52. Is Scalding of any help here?
  53. 53. Is Scalding of any help here? 0 Size of code
  54. 54. Is Scalding of any help here? 0 Size of code 1 Types
  55. 55. Is Scalding of any help here? 0 Size of code 1 Types 2 Unit Testing
  56. 56. Is Scalding of any help here? 0 Size of code 1 Types 2 Unit Testing 3 Local execution
  57. 57. 1 Types
  58. 58. An extra cycle Continuous Deployment Acceptance Testing Unit Testing Lean Startup
  59. 59. An extra cycle Continuous Deployment Acceptance Testing Unit Testing Compilation Phase Lean Startup
  60. 60. Static typechecking makes you a better programmer™
  61. 61. Fail-fast with type errors (Int,Int,Int,Int)
  62. 62. Fail-fast with type errors (Int,Int,Int,Int) TypedPipe[(Meters,Miles,Celsius,Fahrenheit)]
  63. 63. Fail-fast with type errors (Int,Int,Int,Int) TypedPipe[(Meters,Miles,Celsius,Fahrenheit)] val val val val w x y z = = = = 5 5 5 5 w + x + y + z = 20
  64. 64. Fail-fast with type errors (Int,Int,Int,Int) TypedPipe[(Meters,Miles,Celsius,Fahrenheit)] val val val val w x y z = = = = 5 5 5 5 w + x + y + z = 20 val val val val w x y z = = = = Meters(5) Miles(5) Celsius(5) Fahrenheit(5) w + x + y + z => type error
  65. 65. 2 Unit Testing
  66. 66. How do you test a distributed algorithm without a distributed platform?
  67. 67. Source Tap
  68. 68. Source Tap
  69. 69. Source Tap
  70. 70. // Scalding import com.twitter.scalding._ class WordCountTest extends Specification with TupleConversions { "A WordCount job" should { JobTest("com.snowplowanalytics.hadoop.scalding.WordCountJob"). arg("input", "inputFile"). arg("output", "outputFile"). source(TextLine("inputFile"), List("0" -> "hack hack hack and hack")). sink[(String,Int)](Tsv("outputFile")){ outputBuffer => val outMap = outputBuffer.toMap "count words correctly" in { outMap("hack") must be_==(4) outMap("and") must be_==(1) } }. run. finish } }
  71. 71. 3 Local Execution
  72. 72. HDFS Local
  73. 73. HDFS Local
  74. 74. SBT as a REPL > run-main com.twitter.scalding.Tool MyJob --local > run-main com.twitter.scalding.Tool MyJob --hdfs
  75. 75. More Scalding goodness
  76. 76. More Scalding goodness Algebird
  77. 77. More Scalding goodness Algebird Matrix library

×