Writing Hadoop Jobs in Scala using Scalding
Upcoming SlideShare
Loading in...5
×
 

Like this? Share it with your network

Share

Writing Hadoop Jobs in Scala using Scalding

on

  • 981 views

Talk that I gave at the #BcnDevCon13 about using Scalding and the strong points of using Scala for Big Data data processing

Talk that I gave at the #BcnDevCon13 about using Scalding and the strong points of using Scala for Big Data data processing

Statistics

Views

Total Views
981
Views on SlideShare
968
Embed Views
13

Actions

Likes
3
Downloads
32
Comments
0

3 Embeds 13

http://www.linkedin.com 7
https://twitter.com 3
https://www.linkedin.com 3

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Writing Hadoop Jobs in Scala using Scalding Presentation Transcript

  • 1. Writing Hadoop Jobs in Scala using @tonicebrian Scalding
  • 2. How much storage can 100$ dollars buy you?
  • 3. How much storage can 100$ dollars buy you? 1 photo 1980
  • 4. How much storage can 100$ dollars buy you? 1 photo 5 songs 1980 1990
  • 5. How much storage can 100$ dollars buy you? 1 photo 5 songs 7 movies 1980 1990 2000
  • 6. How much storage can 100$ dollars buy you? 600 movies 170.000 songs 1 photo 5 songs 7 movies 1980 1990 2000 5 million photos 2010
  • 7. From single drives…
  • 8. From single drives… to clusters…
  • 9. Data Science
  • 10. “A mathematician is a device for turning coffee into theorems” Alfréd Rényi
  • 11. data scientist “A mathematician is a device for turning coffee into theorems” Alfréd Rényi
  • 12. data scientist “A mathematician is a device for turning coffee and into theorems” data Alfréd Rényi
  • 13. data scientist “A mathematician is a device for turning coffee and into theorems” insights data Alfréd Rényi
  • 14. Hadoop = Map Distributed + File System Reduce
  • 15. Hadoop Storage = Map Distributed + File System Reduce
  • 16. Hadoop Program Model = Storage Map Distributed + File System Reduce
  • 17. Word Count Raw Hello cruel world Say hello! Hello!
  • 18. Word Count Raw Map hello Hello cruel world Say hello! Hello! 1 cruel 1 world 1 say 1 hello 2
  • 19. Word Count Raw Map Reduce hello hello Hello cruel world 1 2 cruel 1 world 1 say 1 Say hello! Hello!
  • 20. Word Count Raw Map Reduce Result hello 3 Hello cruel world cruel 1 Say hello! Hello! world 1 say 1
  • 21. 4 Main Characteristics of Scala
  • 22. 4 Main Characteristics of Scala JVM
  • 23. 4 Main Characteristics of Scala JVM Statically Typed
  • 24. 4 Main Characteristics of Scala JVM Object Oriented Statically Typed
  • 25. 4 Main Characteristics of Scala JVM Statically Typed Object Oriented Functional Programming
  • 26. def map[B](f: (A) ⇒ B): List[B] Builds a new collection by applying a function to all elements of this list. def reduce[A1 >: A](op: (A1, A1) ⇒ A1): A1 Reduces the elements of this list using the specified associative binary operator.
  • 27. Recap
  • 28. Recap Map/Reduce • Programming paradigm that employs concepts from Functional Programming
  • 29. Recap Map/Reduce • Programming paradigm that employs concepts from Functional Programming Scala • Map/Reduce • Functional Language that runs on the JVM
  • 30. Recap Map/Reduce • Programming paradigm that employs concepts from Functional Programming Scala • Map/Reduce • Functional Language that runs on the JVM Hadoop • Open Source Implementation of MR in the JVM
  • 31. So in what language is Hadoop implemented?
  • 32. The Result?
  • 33. The Result? package org.myorg; import java.io.IOException; import java.util.*; import org.apache.hadoop.fs.Path; import org.apache.hadoop.conf.*; import org.apache.hadoop.io.*; import org.apache.hadoop.mapreduce.*; import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import org.apache.hadoop.mapreduce.lib.input.TextInputFormat; import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat; public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { int sum = 0; for (IntWritable val : values) { sum += val.get(); } context.write(key, new IntWritable(sum)); } } public static void main(String[] args) throws Exception { Configuration conf = new Configuration(); Job job = new Job(conf, "wordcount"); public class WordCount { job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { String line = value.toString(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasMoreTokens()) { word.set(tokenizer.nextToken()); context.write(word, one); } } } job.setMapperClass(Map.class); job.setReducerClass(Reduce.class); job.setInputFormatClass(TextInputFormat.class); job.setOutputFormatClass(TextOutputFormat.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); job.waitForCompletion(true); } }
  • 34. High level approaches SQL Data Transformations
  • 35. High level approaches input_lines = LOAD ‘myfile.txt' AS (line:chararray); words = FOREACH input_lines GENERATE FLATTEN(TOKENIZE(line)) AS word; filtered_words = FILTER words BY word MATCHES 'w+'; word_groups = GROUP filtered_words BY word; word_count = FOREACH word_groups GENERATE COUNT(filtered_words) AS count, group AS word; ordered_word_count = ORDER word_count BY count DESC; STORE ordered_word_count INTO '/tmp/number-of-words-on-internet';
  • 36. User defined functions (UDF) -- myscript.pig REGISTER myudfs.jar; A = LOAD 'student_data' AS (name: chararray, age: int, gpa: float); B = FOREACH A GENERATE myudfs.UPPER(name); package myudfs; import java.io.IOException; DUMP B; import org.apache.pig.EvalFunc; Java Pig import org.apache.pig.data.Tuple; import org.apache.pig.impl.util.WrappedIOException; public class UPPER extends EvalFunc<String> { public String exec(Tuple input) throws IOException { if (input == null || input.size() == 0) return null; try{ String str = (String)input.get(0); return str.toUpperCase(); }catch(Exception e){ throw WrappedIOException.wrap("Caught exception processing input row ", e); } } }
  • 37. WordCount in Cascading package impatient; import java.util.Properties; import cascading.flow.Flow; import cascading.flow.FlowDef; import cascading.flow.hadoop.HadoopFlowConnector; import cascading.operation.aggregator.Count; import cascading.operation.regex.RegexFilter; import cascading.operation.regex.RegexSplitGenerator; import cascading.pipe.Each; import cascading.pipe.Every; import cascading.pipe.GroupBy; import cascading.pipe.Pipe; import cascading.property.AppProps; import cascading.scheme.Scheme; import cascading.scheme.hadoop.TextDelimited; import cascading.tap.Tap; import cascading.tap.hadoop.Hfs; import cascading.tuple.Fields; public class Main { public static void main( String[] args ) { String docPath = args[ 0 ]; String wcPath = args[ 1 ]; ); // create source and sink taps Tap docTap = new Hfs( new TextDelimited( true, "t" ), docPath ); Tap wcTap = new Hfs( new TextDelimited( true, "t" ), wcPath ); // specify a regex operation to split the "document" text lines into a token stream Fields token = new Fields( "token" ); Fields text = new Fields( "text" ); RegexSplitGenerator splitter = new RegexSplitGenerator( token, "[ [](),.]" ); // only returns "token" Pipe docPipe = new Each( "token", text, splitter, Fields.RESULTS ); // determine the word counts Pipe wcPipe = new Pipe( "wc", docPipe ); wcPipe = new GroupBy( wcPipe, token ); wcPipe = new Every( wcPipe, Fields.ALL, new Count(), Fields.ALL ); // connect the taps, pipes, etc., into a flow FlowDef flowDef = FlowDef.flowDef() .setName( "wc" ) .addSource( docPipe, docTap ) .addTailSink( wcPipe, wcTap ); // write a DOT file and run the flow Flow wcFlow = flowConnector.connect( flowDef ); Properties properties = new Properties(); wcFlow.writeDOT( "dot/wc.dot" ); AppProps.setApplicationJarClass( properties, Main.class ); wcFlow.complete(); HadoopFlowConnector flowConnector = new HadoopFlowConnector( properties} }
  • 38. Good parts • Data Flow Programming Model • User Defined Functions
  • 39. Good parts • Data Flow Programming Model • User Defined Functions Bad • Still Java • Objects for Flows
  • 40. package com.twitter.scalding.examples import com.twitter.scalding._ class WordCountJob(args : Args) extends Job(args) { TextLine( args("input") ) .flatMap('line -> 'word) { line : String => tokenize(line) } .groupBy('word) { _.size } .write( Tsv( args("output") ) ) // Split a piece of text into individual words. def tokenize(text : String) : Array[String] = { // Lowercase each word and remove punctuation. text.toLowerCase.replaceAll("[^a-zA-Z0-9s]", "").split("s+") } }
  • 41. TDD Cycle Red Refactor Green
  • 42. Broader view Red … Refactor Continuous Deployment Green Acceptance Testing Unit Testing Lean Startup
  • 43. Big Data Big Speed
  • 44. A typical day working with Hadoop
  • 45. A typical day working with Hadoop
  • 46. A typical day working with Hadoop
  • 47. A typical day working with Hadoop
  • 48. A typical day working with Hadoop
  • 49. A typical day working with Hadoop
  • 50. A typical day working with Hadoop
  • 51. A typical day working with Hadoop
  • 52. Is Scalding of any help here?
  • 53. Is Scalding of any help here? 0 Size of code
  • 54. Is Scalding of any help here? 0 Size of code 1 Types
  • 55. Is Scalding of any help here? 0 Size of code 1 Types 2 Unit Testing
  • 56. Is Scalding of any help here? 0 Size of code 1 Types 2 Unit Testing 3 Local execution
  • 57. 1 Types
  • 58. An extra cycle Continuous Deployment Acceptance Testing Unit Testing Lean Startup
  • 59. An extra cycle Continuous Deployment Acceptance Testing Unit Testing Compilation Phase Lean Startup
  • 60. Static typechecking makes you a better programmer™
  • 61. Fail-fast with type errors (Int,Int,Int,Int)
  • 62. Fail-fast with type errors (Int,Int,Int,Int) TypedPipe[(Meters,Miles,Celsius,Fahrenheit)]
  • 63. Fail-fast with type errors (Int,Int,Int,Int) TypedPipe[(Meters,Miles,Celsius,Fahrenheit)] val val val val w x y z = = = = 5 5 5 5 w + x + y + z = 20
  • 64. Fail-fast with type errors (Int,Int,Int,Int) TypedPipe[(Meters,Miles,Celsius,Fahrenheit)] val val val val w x y z = = = = 5 5 5 5 w + x + y + z = 20 val val val val w x y z = = = = Meters(5) Miles(5) Celsius(5) Fahrenheit(5) w + x + y + z => type error
  • 65. 2 Unit Testing
  • 66. How do you test a distributed algorithm without a distributed platform?
  • 67. Source Tap
  • 68. Source Tap
  • 69. Source Tap
  • 70. // Scalding import com.twitter.scalding._ class WordCountTest extends Specification with TupleConversions { "A WordCount job" should { JobTest("com.snowplowanalytics.hadoop.scalding.WordCountJob"). arg("input", "inputFile"). arg("output", "outputFile"). source(TextLine("inputFile"), List("0" -> "hack hack hack and hack")). sink[(String,Int)](Tsv("outputFile")){ outputBuffer => val outMap = outputBuffer.toMap "count words correctly" in { outMap("hack") must be_==(4) outMap("and") must be_==(1) } }. run. finish } }
  • 71. 3 Local Execution
  • 72. HDFS Local
  • 73. HDFS Local
  • 74. SBT as a REPL > run-main com.twitter.scalding.Tool MyJob --local > run-main com.twitter.scalding.Tool MyJob --hdfs
  • 75. More Scalding goodness
  • 76. More Scalding goodness Algebird
  • 77. More Scalding goodness Algebird Matrix library