Writing Hadoop Jobs in Scala using Scalding

  • 997 views
Uploaded on

Talk that I gave at the #BcnDevCon13 about using Scalding and the strong points of using Scala for Big Data data processing

Talk that I gave at the #BcnDevCon13 about using Scalding and the strong points of using Scala for Big Data data processing

More in: Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
997
On Slideshare
0
From Embeds
0
Number of Embeds
2

Actions

Shares
Downloads
34
Comments
0
Likes
3

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Writing Hadoop Jobs in Scala using @tonicebrian Scalding
  • 2. How much storage can 100$ dollars buy you?
  • 3. How much storage can 100$ dollars buy you? 1 photo 1980
  • 4. How much storage can 100$ dollars buy you? 1 photo 5 songs 1980 1990
  • 5. How much storage can 100$ dollars buy you? 1 photo 5 songs 7 movies 1980 1990 2000
  • 6. How much storage can 100$ dollars buy you? 600 movies 170.000 songs 1 photo 5 songs 7 movies 1980 1990 2000 5 million photos 2010
  • 7. From single drives…
  • 8. From single drives… to clusters…
  • 9. Data Science
  • 10. “A mathematician is a device for turning coffee into theorems” Alfréd Rényi
  • 11. data scientist “A mathematician is a device for turning coffee into theorems” Alfréd Rényi
  • 12. data scientist “A mathematician is a device for turning coffee and into theorems” data Alfréd Rényi
  • 13. data scientist “A mathematician is a device for turning coffee and into theorems” insights data Alfréd Rényi
  • 14. Hadoop = Map Distributed + File System Reduce
  • 15. Hadoop Storage = Map Distributed + File System Reduce
  • 16. Hadoop Program Model = Storage Map Distributed + File System Reduce
  • 17. Word Count Raw Hello cruel world Say hello! Hello!
  • 18. Word Count Raw Map hello Hello cruel world Say hello! Hello! 1 cruel 1 world 1 say 1 hello 2
  • 19. Word Count Raw Map Reduce hello hello Hello cruel world 1 2 cruel 1 world 1 say 1 Say hello! Hello!
  • 20. Word Count Raw Map Reduce Result hello 3 Hello cruel world cruel 1 Say hello! Hello! world 1 say 1
  • 21. 4 Main Characteristics of Scala
  • 22. 4 Main Characteristics of Scala JVM
  • 23. 4 Main Characteristics of Scala JVM Statically Typed
  • 24. 4 Main Characteristics of Scala JVM Object Oriented Statically Typed
  • 25. 4 Main Characteristics of Scala JVM Statically Typed Object Oriented Functional Programming
  • 26. def map[B](f: (A) ⇒ B): List[B] Builds a new collection by applying a function to all elements of this list. def reduce[A1 >: A](op: (A1, A1) ⇒ A1): A1 Reduces the elements of this list using the specified associative binary operator.
  • 27. Recap
  • 28. Recap Map/Reduce • Programming paradigm that employs concepts from Functional Programming
  • 29. Recap Map/Reduce • Programming paradigm that employs concepts from Functional Programming Scala • Map/Reduce • Functional Language that runs on the JVM
  • 30. Recap Map/Reduce • Programming paradigm that employs concepts from Functional Programming Scala • Map/Reduce • Functional Language that runs on the JVM Hadoop • Open Source Implementation of MR in the JVM
  • 31. So in what language is Hadoop implemented?
  • 32. The Result?
  • 33. The Result? package org.myorg; import java.io.IOException; import java.util.*; import org.apache.hadoop.fs.Path; import org.apache.hadoop.conf.*; import org.apache.hadoop.io.*; import org.apache.hadoop.mapreduce.*; import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import org.apache.hadoop.mapreduce.lib.input.TextInputFormat; import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat; public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { int sum = 0; for (IntWritable val : values) { sum += val.get(); } context.write(key, new IntWritable(sum)); } } public static void main(String[] args) throws Exception { Configuration conf = new Configuration(); Job job = new Job(conf, "wordcount"); public class WordCount { job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { String line = value.toString(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasMoreTokens()) { word.set(tokenizer.nextToken()); context.write(word, one); } } } job.setMapperClass(Map.class); job.setReducerClass(Reduce.class); job.setInputFormatClass(TextInputFormat.class); job.setOutputFormatClass(TextOutputFormat.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); job.waitForCompletion(true); } }
  • 34. High level approaches SQL Data Transformations
  • 35. High level approaches input_lines = LOAD ‘myfile.txt' AS (line:chararray); words = FOREACH input_lines GENERATE FLATTEN(TOKENIZE(line)) AS word; filtered_words = FILTER words BY word MATCHES 'w+'; word_groups = GROUP filtered_words BY word; word_count = FOREACH word_groups GENERATE COUNT(filtered_words) AS count, group AS word; ordered_word_count = ORDER word_count BY count DESC; STORE ordered_word_count INTO '/tmp/number-of-words-on-internet';
  • 36. User defined functions (UDF) -- myscript.pig REGISTER myudfs.jar; A = LOAD 'student_data' AS (name: chararray, age: int, gpa: float); B = FOREACH A GENERATE myudfs.UPPER(name); package myudfs; import java.io.IOException; DUMP B; import org.apache.pig.EvalFunc; Java Pig import org.apache.pig.data.Tuple; import org.apache.pig.impl.util.WrappedIOException; public class UPPER extends EvalFunc<String> { public String exec(Tuple input) throws IOException { if (input == null || input.size() == 0) return null; try{ String str = (String)input.get(0); return str.toUpperCase(); }catch(Exception e){ throw WrappedIOException.wrap("Caught exception processing input row ", e); } } }
  • 37. WordCount in Cascading package impatient; import java.util.Properties; import cascading.flow.Flow; import cascading.flow.FlowDef; import cascading.flow.hadoop.HadoopFlowConnector; import cascading.operation.aggregator.Count; import cascading.operation.regex.RegexFilter; import cascading.operation.regex.RegexSplitGenerator; import cascading.pipe.Each; import cascading.pipe.Every; import cascading.pipe.GroupBy; import cascading.pipe.Pipe; import cascading.property.AppProps; import cascading.scheme.Scheme; import cascading.scheme.hadoop.TextDelimited; import cascading.tap.Tap; import cascading.tap.hadoop.Hfs; import cascading.tuple.Fields; public class Main { public static void main( String[] args ) { String docPath = args[ 0 ]; String wcPath = args[ 1 ]; ); // create source and sink taps Tap docTap = new Hfs( new TextDelimited( true, "t" ), docPath ); Tap wcTap = new Hfs( new TextDelimited( true, "t" ), wcPath ); // specify a regex operation to split the "document" text lines into a token stream Fields token = new Fields( "token" ); Fields text = new Fields( "text" ); RegexSplitGenerator splitter = new RegexSplitGenerator( token, "[ [](),.]" ); // only returns "token" Pipe docPipe = new Each( "token", text, splitter, Fields.RESULTS ); // determine the word counts Pipe wcPipe = new Pipe( "wc", docPipe ); wcPipe = new GroupBy( wcPipe, token ); wcPipe = new Every( wcPipe, Fields.ALL, new Count(), Fields.ALL ); // connect the taps, pipes, etc., into a flow FlowDef flowDef = FlowDef.flowDef() .setName( "wc" ) .addSource( docPipe, docTap ) .addTailSink( wcPipe, wcTap ); // write a DOT file and run the flow Flow wcFlow = flowConnector.connect( flowDef ); Properties properties = new Properties(); wcFlow.writeDOT( "dot/wc.dot" ); AppProps.setApplicationJarClass( properties, Main.class ); wcFlow.complete(); HadoopFlowConnector flowConnector = new HadoopFlowConnector( properties} }
  • 38. Good parts • Data Flow Programming Model • User Defined Functions
  • 39. Good parts • Data Flow Programming Model • User Defined Functions Bad • Still Java • Objects for Flows
  • 40. package com.twitter.scalding.examples import com.twitter.scalding._ class WordCountJob(args : Args) extends Job(args) { TextLine( args("input") ) .flatMap('line -> 'word) { line : String => tokenize(line) } .groupBy('word) { _.size } .write( Tsv( args("output") ) ) // Split a piece of text into individual words. def tokenize(text : String) : Array[String] = { // Lowercase each word and remove punctuation. text.toLowerCase.replaceAll("[^a-zA-Z0-9s]", "").split("s+") } }
  • 41. TDD Cycle Red Refactor Green
  • 42. Broader view Red … Refactor Continuous Deployment Green Acceptance Testing Unit Testing Lean Startup
  • 43. Big Data Big Speed
  • 44. A typical day working with Hadoop
  • 45. A typical day working with Hadoop
  • 46. A typical day working with Hadoop
  • 47. A typical day working with Hadoop
  • 48. A typical day working with Hadoop
  • 49. A typical day working with Hadoop
  • 50. A typical day working with Hadoop
  • 51. A typical day working with Hadoop
  • 52. Is Scalding of any help here?
  • 53. Is Scalding of any help here? 0 Size of code
  • 54. Is Scalding of any help here? 0 Size of code 1 Types
  • 55. Is Scalding of any help here? 0 Size of code 1 Types 2 Unit Testing
  • 56. Is Scalding of any help here? 0 Size of code 1 Types 2 Unit Testing 3 Local execution
  • 57. 1 Types
  • 58. An extra cycle Continuous Deployment Acceptance Testing Unit Testing Lean Startup
  • 59. An extra cycle Continuous Deployment Acceptance Testing Unit Testing Compilation Phase Lean Startup
  • 60. Static typechecking makes you a better programmer™
  • 61. Fail-fast with type errors (Int,Int,Int,Int)
  • 62. Fail-fast with type errors (Int,Int,Int,Int) TypedPipe[(Meters,Miles,Celsius,Fahrenheit)]
  • 63. Fail-fast with type errors (Int,Int,Int,Int) TypedPipe[(Meters,Miles,Celsius,Fahrenheit)] val val val val w x y z = = = = 5 5 5 5 w + x + y + z = 20
  • 64. Fail-fast with type errors (Int,Int,Int,Int) TypedPipe[(Meters,Miles,Celsius,Fahrenheit)] val val val val w x y z = = = = 5 5 5 5 w + x + y + z = 20 val val val val w x y z = = = = Meters(5) Miles(5) Celsius(5) Fahrenheit(5) w + x + y + z => type error
  • 65. 2 Unit Testing
  • 66. How do you test a distributed algorithm without a distributed platform?
  • 67. Source Tap
  • 68. Source Tap
  • 69. Source Tap
  • 70. // Scalding import com.twitter.scalding._ class WordCountTest extends Specification with TupleConversions { "A WordCount job" should { JobTest("com.snowplowanalytics.hadoop.scalding.WordCountJob"). arg("input", "inputFile"). arg("output", "outputFile"). source(TextLine("inputFile"), List("0" -> "hack hack hack and hack")). sink[(String,Int)](Tsv("outputFile")){ outputBuffer => val outMap = outputBuffer.toMap "count words correctly" in { outMap("hack") must be_==(4) outMap("and") must be_==(1) } }. run. finish } }
  • 71. 3 Local Execution
  • 72. HDFS Local
  • 73. HDFS Local
  • 74. SBT as a REPL > run-main com.twitter.scalding.Tool MyJob --local > run-main com.twitter.scalding.Tool MyJob --hdfs
  • 75. More Scalding goodness
  • 76. More Scalding goodness Algebird
  • 77. More Scalding goodness Algebird Matrix library