Your SlideShare is downloading. ×
0
Writing Hadoop Jobs in Scala using
@tonicebrian
Scalding
How much storage can
100$ dollars buy you?
How much storage can
100$ dollars buy you?

1 photo
1980
How much storage can
100$ dollars buy you?

1 photo

5 songs

1980

1990
How much storage can
100$ dollars buy you?

1 photo

5 songs

7 movies

1980

1990

2000
How much storage can
100$ dollars buy you?
600 movies

170.000 songs

1 photo

5 songs

7 movies

1980

1990

2000

5 mill...
From single drives…
From single drives…

to clusters…
Data
Science
“A mathematician is a
device for turning coffee
into theorems”

Alfréd Rényi
data scientist

“A mathematician is a
device for turning coffee
into theorems”

Alfréd Rényi
data scientist

“A mathematician is a
device for turning coffee
and
into theorems”
data

Alfréd Rényi
data scientist

“A mathematician is a
device for turning coffee
and
into theorems”
insights

data

Alfréd Rényi
Hadoop
=

Map
Distributed
+
File System
Reduce
Hadoop

Storage
=

Map
Distributed
+
File System
Reduce
Hadoop

Program
Model

=

Storage

Map
Distributed
+
File System
Reduce
Word Count

Raw

Hello cruel world
Say hello! Hello!
Word Count

Raw

Map
hello

Hello cruel world
Say hello! Hello!

1

cruel

1

world

1

say

1

hello

2
Word Count

Raw

Map

Reduce
hello

hello
Hello cruel world

1

2

cruel

1

world

1

say

1

Say hello! Hello!
Word Count

Raw

Map

Reduce

Result
hello

3

Hello cruel world

cruel

1

Say hello! Hello!

world

1

say

1
4 Main Characteristics of Scala
4 Main Characteristics of Scala

JVM
4 Main Characteristics of Scala

JVM

Statically
Typed
4 Main Characteristics of Scala

JVM

Object
Oriented

Statically
Typed
4 Main Characteristics of Scala

JVM

Statically
Typed

Object
Oriented

Functional
Programming
def map[B](f: (A) ⇒ B): List[B]
Builds a new collection by applying a function to all
elements of this list.

def reduce[A...
Recap
Recap
Map/Reduce
• Programming paradigm that employs concepts from Functional
Programming
Recap
Map/Reduce
• Programming paradigm that employs concepts from Functional
Programming
Scala

• Map/Reduce

• Functiona...
Recap
Map/Reduce
• Programming paradigm that employs concepts from Functional
Programming
Scala

• Map/Reduce

• Functiona...
So in what language is Hadoop
implemented?
The Result?
The Result?
package org.myorg;
import java.io.IOException;
import java.util.*;
import org.apache.hadoop.fs.Path;
import or...
High level approaches

SQL

Data
Transformations
High level approaches

input_lines = LOAD ‘myfile.txt' AS (line:chararray);
words = FOREACH input_lines GENERATE FLATTEN(T...
User defined functions (UDF)
-- myscript.pig
REGISTER myudfs.jar;
A = LOAD 'student_data' AS (name: chararray,
age: int, g...
WordCount in Cascading
package impatient;
import java.util.Properties;
import cascading.flow.Flow;
import cascading.flow.F...
Good parts
• Data Flow Programming Model
• User Defined Functions
Good parts
• Data Flow Programming Model
• User Defined Functions
Bad

• Still Java
• Objects for Flows
package com.twitter.scalding.examples
import com.twitter.scalding._
class WordCountJob(args : Args) extends Job(args) {
Te...
TDD Cycle
Red

Refactor

Green
Broader view
Red

…
Refactor

Continuous
Deployment

Green

Acceptance
Testing
Unit
Testing

Lean
Startup
Big Data

Big Speed
A typical day working with Hadoop
A typical day working with Hadoop
A typical day working with Hadoop
A typical day working with Hadoop
A typical day working with Hadoop
A typical day working with Hadoop
A typical day working with Hadoop
A typical day working with Hadoop
Is Scalding of any help here?
Is Scalding of any help here?

0

Size of code
Is Scalding of any help here?

0

Size of code

1

Types
Is Scalding of any help here?

0

Size of code

1

Types

2

Unit Testing
Is Scalding of any help here?

0

Size of code

1

Types

2

Unit Testing

3

Local execution
1
Types
An extra cycle

Continuous
Deployment

Acceptance
Testing
Unit
Testing

Lean
Startup
An extra cycle

Continuous
Deployment

Acceptance
Testing
Unit
Testing
Compilation
Phase

Lean
Startup
Static typechecking makes
you a better
programmer™
Fail-fast with type errors
(Int,Int,Int,Int)
Fail-fast with type errors
(Int,Int,Int,Int)
TypedPipe[(Meters,Miles,Celsius,Fahrenheit)]
Fail-fast with type errors
(Int,Int,Int,Int)
TypedPipe[(Meters,Miles,Celsius,Fahrenheit)]
val
val
val
val

w
x
y
z

=
=
=
...
Fail-fast with type errors
(Int,Int,Int,Int)
TypedPipe[(Meters,Miles,Celsius,Fahrenheit)]
val
val
val
val

w
x
y
z

=
=
=
...
2
Unit Testing
How do you test a distributed
algorithm without a distributed
platform?
Source

Tap
Source

Tap
Source

Tap
// Scalding
import com.twitter.scalding._
class WordCountTest extends Specification with TupleConversions {
"A WordCount j...
3
Local Execution
HDFS

Local
HDFS

Local
SBT as a REPL

> run-main com.twitter.scalding.Tool MyJob --local
> run-main com.twitter.scalding.Tool MyJob --hdfs
More Scalding goodness
More Scalding goodness

Algebird
More Scalding goodness

Algebird

Matrix library
Writing Hadoop Jobs in Scala using Scalding
Writing Hadoop Jobs in Scala using Scalding
Writing Hadoop Jobs in Scala using Scalding
Writing Hadoop Jobs in Scala using Scalding
Writing Hadoop Jobs in Scala using Scalding
Writing Hadoop Jobs in Scala using Scalding
Writing Hadoop Jobs in Scala using Scalding
Writing Hadoop Jobs in Scala using Scalding
Writing Hadoop Jobs in Scala using Scalding
Writing Hadoop Jobs in Scala using Scalding
Writing Hadoop Jobs in Scala using Scalding
Writing Hadoop Jobs in Scala using Scalding
Writing Hadoop Jobs in Scala using Scalding
Writing Hadoop Jobs in Scala using Scalding
Writing Hadoop Jobs in Scala using Scalding
Writing Hadoop Jobs in Scala using Scalding
Upcoming SlideShare
Loading in...5
×

Writing Hadoop Jobs in Scala using Scalding

1,545

Published on

Talk that I gave at the #BcnDevCon13 about using Scalding and the strong points of using Scala for Big Data data processing

Published in: Technology
0 Comments
4 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
1,545
On Slideshare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
44
Comments
0
Likes
4
Embeds 0
No embeds

No notes for slide

Transcript of "Writing Hadoop Jobs in Scala using Scalding"

  1. 1. Writing Hadoop Jobs in Scala using @tonicebrian Scalding
  2. 2. How much storage can 100$ dollars buy you?
  3. 3. How much storage can 100$ dollars buy you? 1 photo 1980
  4. 4. How much storage can 100$ dollars buy you? 1 photo 5 songs 1980 1990
  5. 5. How much storage can 100$ dollars buy you? 1 photo 5 songs 7 movies 1980 1990 2000
  6. 6. How much storage can 100$ dollars buy you? 600 movies 170.000 songs 1 photo 5 songs 7 movies 1980 1990 2000 5 million photos 2010
  7. 7. From single drives…
  8. 8. From single drives… to clusters…
  9. 9. Data Science
  10. 10. “A mathematician is a device for turning coffee into theorems” Alfréd Rényi
  11. 11. data scientist “A mathematician is a device for turning coffee into theorems” Alfréd Rényi
  12. 12. data scientist “A mathematician is a device for turning coffee and into theorems” data Alfréd Rényi
  13. 13. data scientist “A mathematician is a device for turning coffee and into theorems” insights data Alfréd Rényi
  14. 14. Hadoop = Map Distributed + File System Reduce
  15. 15. Hadoop Storage = Map Distributed + File System Reduce
  16. 16. Hadoop Program Model = Storage Map Distributed + File System Reduce
  17. 17. Word Count Raw Hello cruel world Say hello! Hello!
  18. 18. Word Count Raw Map hello Hello cruel world Say hello! Hello! 1 cruel 1 world 1 say 1 hello 2
  19. 19. Word Count Raw Map Reduce hello hello Hello cruel world 1 2 cruel 1 world 1 say 1 Say hello! Hello!
  20. 20. Word Count Raw Map Reduce Result hello 3 Hello cruel world cruel 1 Say hello! Hello! world 1 say 1
  21. 21. 4 Main Characteristics of Scala
  22. 22. 4 Main Characteristics of Scala JVM
  23. 23. 4 Main Characteristics of Scala JVM Statically Typed
  24. 24. 4 Main Characteristics of Scala JVM Object Oriented Statically Typed
  25. 25. 4 Main Characteristics of Scala JVM Statically Typed Object Oriented Functional Programming
  26. 26. def map[B](f: (A) ⇒ B): List[B] Builds a new collection by applying a function to all elements of this list. def reduce[A1 >: A](op: (A1, A1) ⇒ A1): A1 Reduces the elements of this list using the specified associative binary operator.
  27. 27. Recap
  28. 28. Recap Map/Reduce • Programming paradigm that employs concepts from Functional Programming
  29. 29. Recap Map/Reduce • Programming paradigm that employs concepts from Functional Programming Scala • Map/Reduce • Functional Language that runs on the JVM
  30. 30. Recap Map/Reduce • Programming paradigm that employs concepts from Functional Programming Scala • Map/Reduce • Functional Language that runs on the JVM Hadoop • Open Source Implementation of MR in the JVM
  31. 31. So in what language is Hadoop implemented?
  32. 32. The Result?
  33. 33. The Result? package org.myorg; import java.io.IOException; import java.util.*; import org.apache.hadoop.fs.Path; import org.apache.hadoop.conf.*; import org.apache.hadoop.io.*; import org.apache.hadoop.mapreduce.*; import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import org.apache.hadoop.mapreduce.lib.input.TextInputFormat; import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat; public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { int sum = 0; for (IntWritable val : values) { sum += val.get(); } context.write(key, new IntWritable(sum)); } } public static void main(String[] args) throws Exception { Configuration conf = new Configuration(); Job job = new Job(conf, "wordcount"); public class WordCount { job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { String line = value.toString(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasMoreTokens()) { word.set(tokenizer.nextToken()); context.write(word, one); } } } job.setMapperClass(Map.class); job.setReducerClass(Reduce.class); job.setInputFormatClass(TextInputFormat.class); job.setOutputFormatClass(TextOutputFormat.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); job.waitForCompletion(true); } }
  34. 34. High level approaches SQL Data Transformations
  35. 35. High level approaches input_lines = LOAD ‘myfile.txt' AS (line:chararray); words = FOREACH input_lines GENERATE FLATTEN(TOKENIZE(line)) AS word; filtered_words = FILTER words BY word MATCHES 'w+'; word_groups = GROUP filtered_words BY word; word_count = FOREACH word_groups GENERATE COUNT(filtered_words) AS count, group AS word; ordered_word_count = ORDER word_count BY count DESC; STORE ordered_word_count INTO '/tmp/number-of-words-on-internet';
  36. 36. User defined functions (UDF) -- myscript.pig REGISTER myudfs.jar; A = LOAD 'student_data' AS (name: chararray, age: int, gpa: float); B = FOREACH A GENERATE myudfs.UPPER(name); package myudfs; import java.io.IOException; DUMP B; import org.apache.pig.EvalFunc; Java Pig import org.apache.pig.data.Tuple; import org.apache.pig.impl.util.WrappedIOException; public class UPPER extends EvalFunc<String> { public String exec(Tuple input) throws IOException { if (input == null || input.size() == 0) return null; try{ String str = (String)input.get(0); return str.toUpperCase(); }catch(Exception e){ throw WrappedIOException.wrap("Caught exception processing input row ", e); } } }
  37. 37. WordCount in Cascading package impatient; import java.util.Properties; import cascading.flow.Flow; import cascading.flow.FlowDef; import cascading.flow.hadoop.HadoopFlowConnector; import cascading.operation.aggregator.Count; import cascading.operation.regex.RegexFilter; import cascading.operation.regex.RegexSplitGenerator; import cascading.pipe.Each; import cascading.pipe.Every; import cascading.pipe.GroupBy; import cascading.pipe.Pipe; import cascading.property.AppProps; import cascading.scheme.Scheme; import cascading.scheme.hadoop.TextDelimited; import cascading.tap.Tap; import cascading.tap.hadoop.Hfs; import cascading.tuple.Fields; public class Main { public static void main( String[] args ) { String docPath = args[ 0 ]; String wcPath = args[ 1 ]; ); // create source and sink taps Tap docTap = new Hfs( new TextDelimited( true, "t" ), docPath ); Tap wcTap = new Hfs( new TextDelimited( true, "t" ), wcPath ); // specify a regex operation to split the "document" text lines into a token stream Fields token = new Fields( "token" ); Fields text = new Fields( "text" ); RegexSplitGenerator splitter = new RegexSplitGenerator( token, "[ [](),.]" ); // only returns "token" Pipe docPipe = new Each( "token", text, splitter, Fields.RESULTS ); // determine the word counts Pipe wcPipe = new Pipe( "wc", docPipe ); wcPipe = new GroupBy( wcPipe, token ); wcPipe = new Every( wcPipe, Fields.ALL, new Count(), Fields.ALL ); // connect the taps, pipes, etc., into a flow FlowDef flowDef = FlowDef.flowDef() .setName( "wc" ) .addSource( docPipe, docTap ) .addTailSink( wcPipe, wcTap ); // write a DOT file and run the flow Flow wcFlow = flowConnector.connect( flowDef ); Properties properties = new Properties(); wcFlow.writeDOT( "dot/wc.dot" ); AppProps.setApplicationJarClass( properties, Main.class ); wcFlow.complete(); HadoopFlowConnector flowConnector = new HadoopFlowConnector( properties} }
  38. 38. Good parts • Data Flow Programming Model • User Defined Functions
  39. 39. Good parts • Data Flow Programming Model • User Defined Functions Bad • Still Java • Objects for Flows
  40. 40. package com.twitter.scalding.examples import com.twitter.scalding._ class WordCountJob(args : Args) extends Job(args) { TextLine( args("input") ) .flatMap('line -> 'word) { line : String => tokenize(line) } .groupBy('word) { _.size } .write( Tsv( args("output") ) ) // Split a piece of text into individual words. def tokenize(text : String) : Array[String] = { // Lowercase each word and remove punctuation. text.toLowerCase.replaceAll("[^a-zA-Z0-9s]", "").split("s+") } }
  41. 41. TDD Cycle Red Refactor Green
  42. 42. Broader view Red … Refactor Continuous Deployment Green Acceptance Testing Unit Testing Lean Startup
  43. 43. Big Data Big Speed
  44. 44. A typical day working with Hadoop
  45. 45. A typical day working with Hadoop
  46. 46. A typical day working with Hadoop
  47. 47. A typical day working with Hadoop
  48. 48. A typical day working with Hadoop
  49. 49. A typical day working with Hadoop
  50. 50. A typical day working with Hadoop
  51. 51. A typical day working with Hadoop
  52. 52. Is Scalding of any help here?
  53. 53. Is Scalding of any help here? 0 Size of code
  54. 54. Is Scalding of any help here? 0 Size of code 1 Types
  55. 55. Is Scalding of any help here? 0 Size of code 1 Types 2 Unit Testing
  56. 56. Is Scalding of any help here? 0 Size of code 1 Types 2 Unit Testing 3 Local execution
  57. 57. 1 Types
  58. 58. An extra cycle Continuous Deployment Acceptance Testing Unit Testing Lean Startup
  59. 59. An extra cycle Continuous Deployment Acceptance Testing Unit Testing Compilation Phase Lean Startup
  60. 60. Static typechecking makes you a better programmer™
  61. 61. Fail-fast with type errors (Int,Int,Int,Int)
  62. 62. Fail-fast with type errors (Int,Int,Int,Int) TypedPipe[(Meters,Miles,Celsius,Fahrenheit)]
  63. 63. Fail-fast with type errors (Int,Int,Int,Int) TypedPipe[(Meters,Miles,Celsius,Fahrenheit)] val val val val w x y z = = = = 5 5 5 5 w + x + y + z = 20
  64. 64. Fail-fast with type errors (Int,Int,Int,Int) TypedPipe[(Meters,Miles,Celsius,Fahrenheit)] val val val val w x y z = = = = 5 5 5 5 w + x + y + z = 20 val val val val w x y z = = = = Meters(5) Miles(5) Celsius(5) Fahrenheit(5) w + x + y + z => type error
  65. 65. 2 Unit Testing
  66. 66. How do you test a distributed algorithm without a distributed platform?
  67. 67. Source Tap
  68. 68. Source Tap
  69. 69. Source Tap
  70. 70. // Scalding import com.twitter.scalding._ class WordCountTest extends Specification with TupleConversions { "A WordCount job" should { JobTest("com.snowplowanalytics.hadoop.scalding.WordCountJob"). arg("input", "inputFile"). arg("output", "outputFile"). source(TextLine("inputFile"), List("0" -> "hack hack hack and hack")). sink[(String,Int)](Tsv("outputFile")){ outputBuffer => val outMap = outputBuffer.toMap "count words correctly" in { outMap("hack") must be_==(4) outMap("and") must be_==(1) } }. run. finish } }
  71. 71. 3 Local Execution
  72. 72. HDFS Local
  73. 73. HDFS Local
  74. 74. SBT as a REPL > run-main com.twitter.scalding.Tool MyJob --local > run-main com.twitter.scalding.Tool MyJob --hdfs
  75. 75. More Scalding goodness
  76. 76. More Scalding goodness Algebird
  77. 77. More Scalding goodness Algebird Matrix library
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×