Scalding - Hadoop Word Count in LESS than 70 lines of code

Scalding
Hadoop Word Count
in < 70 lines of code

Konrad 'ktoso' Malawski
JARCamp #3 12.04.2013

Scalding
Hadoop Word Count

in 4 lines of code

Konrad 'ktoso' Malawski
JARCamp #3 12.04.2013

softwaremill.com / java.pl / sckrk.com / geecon.org / krakowscala.pl / gdgkrakow.pl

Agenda
Why Scalding? (10%)
+

Agenda
Why Scalding? (10%)
+
Hadoop Basics (20%)

Agenda
Why Scalding? (10%)
+
Hadoop Basics (20%)
+

Agenda
Why Scalding? (10%)
+
Hadoop Basics (20%)
+
Enter Cascading (40%)

Agenda
Why Scalding? (10%)
+
Hadoop Basics (20%)
+
+

Agenda
Why Scalding? (10%)
+
Hadoop Basics (20%)
+
+
Hello Scalding (30%)

Agenda
Why Scalding? (10%)
+
Hadoop Basics (20%)
+
+
=

Agenda
Why Scalding? (10%)
+
Hadoop Basics (20%)
+
+
=
100%

Why Scalding?
Word Count in Types

type Word = String
type Count = Int

String => Map[Word, Count]

Why Scalding?
Word Count in Scala

Why Scalding?
Word Count in Scala

val text = "a a a b b"

Why Scalding?
Word Count in Scala


def wordCount(text: String): Map[Word, Count] =

Why Scalding?
Word Count in Scala


text

Why Scalding?
Word Count in Scala


text
.split(" ")

Why Scalding?
Word Count in Scala


text
.split(" ")
.map(a => (a, 1))

Why Scalding?
Word Count in Scala


text
.split(" ")
.map(a => (a, 1))
.groupBy(_._1)

Why Scalding?
Word Count in Scala


text
.split(" ")
.map(a => (a, 1))
.groupBy(_._1)
.map { a => a._1 -> a._2.map(_._2).sum }

Why Scalding?
Word Count in Scala


text
.split(" ")
.map(a => (a, 1))
.groupBy(_._1)
.map { a => a._1 -> a._2.map(_._2).sum }

wordCount(text) should equal (Map("a" -> 3), ("b" -> 2)))

Stuff > Memory
Scala collections... fun but, memory bound!

val text = "so many words... waaah! ..."

text
.split(" ")
.map(a => (a, 1))
.groupBy(_._1)
.map(a => (a._1, a._2.map(_._2).sum))

Stuff > Memory

in Memory

text
.split(" ")
.map(a => (a, 1))
.groupBy(_._1)
.map(a => (a._1, a._2.map(_._2).sum))

Stuff > Memory

in Memory

in Memory
text
.split(" ")
.map(a => (a, 1))
.groupBy(_._1)
.map(a => (a._1, a._2.map(_._2).sum))

Stuff > Memory

in Memory

in Memory
text
in Memory
.split(" ")
.map(a => (a, 1))
.groupBy(_._1)
.map(a => (a._1, a._2.map(_._2).sum))

Stuff > Memory

in Memory

in Memory
text
in Memory
.split(" ")
.map(a => (a, 1)) in Memory
.groupBy(_._1)
.map(a => (a._1, a._2.map(_._2).sum))

Stuff > Memory

in Memory

in Memory
text
in Memory
.split(" ")
.map(a => (a, 1)) in Memory
.groupBy(_._1)
.map(a => (a._1, a._2.map(_._2).sum))

in Memory

Apache Hadoop (HDFS + MR)
http://hadoop.apache.org/

Why Scalding?
Word Count in Hadoop MR

package org.myorg;

import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.*;

import java.io.IOException;
import java.util.Iterator;
import java.util.StringTokenizer;

public class WordCount {

public static class Map extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();

public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) thro
IOException {
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens()) {
word.set(tokenizer.nextToken());
output.collect(word, one);

private final static IntWritable one = new IntWritable(1);

Why Scalding?
private Text word = new Text();

public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) thro
IOException {
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens()) {
word.set(tokenizer.nextToken());

Word Count in Hadoop MR
output.collect(word, one);
}
}
}

public static class Reduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> {
public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter
reporter) throws IOException {
int sum = 0;
while (values.hasNext()) {
sum += values.next().get();
}
output.collect(key, new IntWritable(sum));
}
}

public static void main(String[] args) throws Exception {
JobConf conf = new JobConf(WordCount.class);
conf.setJobName("wordcount");

conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(IntWritable.class);

conf.setMapperClass(Map.class);
conf.setCombinerClass(Reduce.class);
conf.setReducerClass(Reduce.class);

conf.setInputFormat(TextInputFormat.class);
conf.setOutputFormat(TextOutputFormat.class);

FileInputFormat.setInputPaths(conf, new Path(args[0]));
FileOutputFormat.setOutputPath(conf, new Path(args[1]));

JobClient.runJob(conf);
}
}

Cascading
is
Taps & Pipes

Cascading
is
Taps & Pipes

& Sinks

1: Distributed Copy

// source Tap
Tap inTap = new Hfs(new TextDelimited(true, "t"), inPath);

1: Distributed Copy

// source Tap

// sink Tap
Tap outTap = new Hfs(new TextDelimited(true, "t"), outPath);

1: Distributed Copy

// source Tap

// sink Tap

// a Pipe, connects taps
Pipe copyPipe = new Pipe("copy");

1: Distributed Copy

// source Tap

// sink Tap


// build the Flow
FlowDef flowDef = FlowDef.flowDef()

1: Distributed Copy

// source Tap

// sink Tap


// build the Flow
.addSource( copyPipe, inTap )

1: Distributed Copy

// source Tap

// sink Tap


// build the Flow
.addSource(copyPipe, inTap)
.addTailSink(copyPipe, outTap);

1: Distributed Copy

// source Tap

// sink Tap


// build the Flow

// run!
flowConnector.connect(flowDef).complete();

1. DCP - Full Code
public class Main {
public static void main(String[] args ) {
String inPath = args[0]; String outPath = args[1];

Properties props = new Properties();
AppProps.setApplicationJarClass(properties, Main.class);
HadoopFlowConnector flowConnector = new HadoopFlowConnector(props);

Tap inTap = new Hfs( new TextDelimited(true, "t"), inPath);




flowConnector.connect(flowDef).complete();
}
}

2: Word Count

String docPath = args[ 0 ];
String wcPath = args[ 1 ];

Properties properties = new Properties();
AppProps.setApplicationJarClass( props, Main.class );
HadoopFlowConnector flowConnector = new HadoopFlowConnector( props );

// create source and sink taps
Tap docTap = new Hfs( new TextDelimited( true, "t" ), docPath );
Tap wcTap = new Hfs( new TextDelimited( true, "t" ), wcPath );

2: Word Count




// specify a regex operation to split the "document" text lines into a
ken stream

2: Word Count




// specify a regex operation to split the "document" text lines into a
ken stream
Fields token = new Fields( "token" );
Fields text = new Fields( "text" );
RegexSplitGenerator splitter = new RegexSplitGenerator( token, "[ [
),.]" );

2: Word Count



RegexSplitGenerator splitter =
new RegexSplitGenerator( token, "[ [](),.]" );

// only returns "token"
Pipe docPipe = new Each( "token", text, splitter, Fields.RESULTS );

// determine the word counts
Pipe wcPipe = new Pipe( "wc", docPipe );
wcPipe = new GroupBy( wcPipe, token );
wcPipe = new Every( wcPipe, Fields.ALL, new Count(), Fields.ALL );


2: Word Count
2: Word Count





// connect the taps, pipes, etc., into a flow
.setName( "wc" )


2: Word Count




.setName( "wc" )
.addSource( docPipe, docTap )
.addTailSink( wcPipe, wcTap );

// write a DOT file and run the flow


2: Word Count



.setName( "wc" )

Flow wcFlow = flowConnector.connect( flowDef );
wcFlow.writeDOT( "dot/wc.dot" );
wcFlow.complete();
}
}


2: Word Count
How it's made

.setName( "wc" )

wcFlow.complete();
}
}


2: Word Count
How it's made

.setName( "wc" )

wcFlow.complete();
}
Graph representation of jobs!
}

2: Word Count
How it's made

http://www.cascading.org/2012/07/09/cascading-for-the-impatient-part-2/

How it's made
val flow = FlowDef

How it's made
val flow = FlowDef

// pseudo code...

How it's made
val flow = FlowDef

// pseudo code...
val jobs: List[MRJob] = flowConnector(flow)

How it's made
val flow = FlowDef

// pseudo code...

// pseudo code...

How it's made
val flow = FlowDef

// pseudo code...

// pseudo code...
HadoopCluster.execute(jobs)

Cascading tips
Pipe assembly = new Pipe( "assembly" );
assembly = new Each( assembly, DebugLevel.VERBOSE, new Debug() );
// ...

// head and tail have same name
FlowDef flowDef = new FlowDef()
.setName( "debug" )
.addSource( "assembly", source )
.addSink( "assembly", sink )
.addTail( assembly );

Cascading tips
// ...

.setName( "debug" )

flowDef.setDebugLevel( DebugLevel.NONE );

Cascading tips
// ...

.setName( "debug" )

flowDef.setDebugLevel( DebugLevel.NONE );

ﬂowConnector will NOT create the Debug pipe!

Scalding
=
+

Twitter Scalding
github.com/twitter/scalding

map
Scala:
val data = 1 :: 2 :: 3 :: Nil

map
Scala:
val data = 1 :: 2 :: 3 :: Nil

val doubled = data map { _ * 2 }

map
Scala:
val data = 1 :: 2 :: 3 :: Nil


// Int => Int

map
Scala:
val data = 1 :: 2 :: 3 :: Nil


// Int => Int

Scalding:
IterableSource(data)

map
Scala:
val data = 1 :: 2 :: 3 :: Nil


// Int => Int

Scalding:
.map('number -> 'doubled) { n: Int => n * 2 }

map
Scala:
val data = 1 :: 2 :: 3 :: Nil


// Int => Int

Scalding:

// Int => Int

map
Scala:
val data = 1 :: 2 :: 3 :: Nil


// Int => Int

Scalding:

available in Pipe // Int => Int

map
Scala:
val data = 1 :: 2 :: 3 :: Nil


// Int => Int

Scalding:

stays in Pipe available in Pipe // Int => Int

map
Scala:
val data = 1 :: 2 :: 3 :: Nil


// Int => Int

Scalding:

must choose type! // Int => Int

mapTo
Scala:
var data = 1 :: 2 :: 3 :: Nil

mapTo
Scala:
var data = 1 :: 2 :: 3 :: Nil

data = null

mapTo
Scala:
var data = 1 :: 2 :: 3 :: Nil

data = null
// Int => Int

mapTo
Scala:
var data = 1 :: 2 :: 3 :: Nil

data = null
// Int => Int
release reference

mapTo
Scala:
var data = 1 :: 2 :: 3 :: Nil

data = null
// Int => Int
release reference

Scalding:

mapTo
Scala:
var data = 1 :: 2 :: 3 :: Nil

data = null
// Int => Int
release reference

Scalding:
.mapTo('doubled) { n: Int => n * 2 }

mapTo
Scala:
var data = 1 :: 2 :: 3 :: Nil

data = null
// Int => Int
release reference

Scalding:

// Int => Int

mapTo
Scala:
var data = 1 :: 2 :: 3 :: Nil

data = null
// Int => Int
release reference

Scalding:

doubled stays in Pipe // Int => Int

mapTo
Scala:
var data = 1 :: 2 :: 3 :: Nil

data = null
// Int => Int
release reference

Scalding:

number is removed doubled stays in Pipe // Int => Int

ﬂatMap
Scala:
val data = "1" :: "2,2" :: "3,3,3" :: Nil // List[String]

ﬂatMap
Scala:

val numbers = data flatMap { line => // String

ﬂatMap
Scala:

line.split(",") // Array[String]

ﬂatMap
Scala:

} map { _.toInt } // List[Int]

ﬂatMap
Scala:


numbers // List[Int]

ﬂatMap
Scala:


numbers should equal (List(1, 2, 2, 3, 3, 3))

ﬂatMap
Scala:



Scalding:
TextLine(data) // like List[String]

ﬂatMap
Scala:



Scalding:
.flatMap('line -> 'word) { _.split(",") } // like List[String]

ﬂatMap
Scala:



Scalding:
.map('word -> 'number) { _.toInt } // like List[Int]

ﬂatMap
Scala:



Scalding:
.map('word -> 'number) { _.toInt } // like List[Int]

MR map outside

ﬂatMap
Scala:

line.split(",").map(_.toInt) // Array[Int]

ﬂatMap
Scala:

}

ﬂatMap
Scala:

}


Scalding:

ﬂatMap
Scala:

}


Scalding:
.flatMap('line -> 'word) { _.split(",").map(_.toInt) }

ﬂatMap
Scala:

}


Scalding:
// like List[Int]

ﬂatMap
Scala:

}


Scalding:
// like List[Int]
map inside Scala

groupBy
Scala:
val data = 1 :: 2 :: 30 :: 42 :: Nil // List[Int]

groupBy
Scala:

val groups = data groupBy { _ < 10 }

groupBy
Scala:


groups // Map[Boolean, Int]

groupBy
Scala:



groups(true) should equal (List(1, 2))

groupBy
Scala:



groups(false) should equal (List(30, 42))

groupBy
Scala:




Scalding:
IterableSource(List(1, 2, 30, 42), 'num)

groupBy
Scala:




Scalding:
.map('num -> 'lessThanTen) { i: Int => i < 10 }

groupBy
Scala:




Scalding:
.groupBy('lessThanTen) { _.size('size) }

groupBy
Scala:




Scalding:

groups all with == value

groupBy
Scala:




Scalding:

groups all with == value => 'size

groupBy

Scalding:

groupBy

Scalding:
.groupBy('lessThanTen) { _.sum('total) }

groupBy

Scalding:
.groupBy('lessThanTen) { _.sum('total) }

'total = [3, 74]

Scalding API
project / discard

Scalding API
project / discard
map / mapTo

Scalding API
project / discard
map / mapTo
ﬂatMap / ﬂatMapTo

Scalding API
project / discard
map / mapTo
rename

Scalding API
project / discard
map / mapTo
rename
ﬁlter

Scalding API
project / discard
map / mapTo
rename
ﬁlter
unique

Scalding API
project / discard
map / mapTo
rename
ﬁlter
unique
groupBy / groupAll / groupRandom / shufﬂe

Scalding API
project / discard
map / mapTo
rename
ﬁlter
unique
limit

Scalding API
project / discard
map / mapTo
rename
ﬁlter
unique
limit
debug

Scalding API
project / discard
map / mapTo
rename
ﬁlter
unique
limit
debug

Group operations

Scalding API
project / discard
map / mapTo
rename
ﬁlter
unique
limit
debug

Group operations

joins

Distributed Copy in Scalding

class WordCountJob(args: Args) extends Job(args) {



val input = Tsv(args("input"))
val output = Tsv(args("output"))




input.read.write(output)

}




input.read.write(output)

}

The End.

Main Class - "Runner"

import org.apache.hadoop.util.ToolRunner
import com.twitter.scalding

object ScaldingJobRunner extends App {

ToolRunner.run(new Configuration, new scalding.Tool, args)

}

Main Class - "Runner"

import org.apache.hadoop.util.ToolRunner
import com.twitter.scalding

object ScaldingJobRunner extends App { from App

ToolRunner.run(new Configuration, new scalding.Tool, args)

}

Word Count in Scalding

}


val inputFile = args("input")
val outputFile = args("output")

}



TextLine(inputFile)

}



TextLine(inputFile)
.flatMap('line -> 'word) { line: String => tokenize(line) }

def tokenize(text: String): Array[String] = implemented
}



TextLine(inputFile)
.groupBy('word) { group => group.size('count) }

}



TextLine(inputFile)
.groupBy('word) { group => group.size }

}



TextLine(inputFile)
.groupBy('word) { _.size }

}



TextLine(inputFile)
.write(Tsv(outputFile))

}



4{
TextLine(inputFile)

}

run pl.project13.scala.oculus.job.WordCountJob --tool.graph


=> pl.project13.scala.oculus.job.WordCountJob0.dot



M
A
P



M
A
P
R
E
D

TextLine(inputFile)
.groupBy('word) { _.size('count) }

Why Scalding?

Hadoop inside

Why Scalding?

Hadoop inside
Cascading abstractions

Why Scalding?

Hadoop inside
Cascading abstractions
Scala conciseness

Ask Stuff!

Dzięki!
Thanks!
ありがとう！

Konrad Malawski @ java.pl
t: ktosopl / g: ktoso
b: blog.project13.pl

Scalding - Hadoop Word Count in LESS than 70 lines of code

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Scalding - Hadoop Word Count in LESS than 70 lines of code

Similar to Scalding - Hadoop Word Count in LESS than 70 lines of code (20)

More from Konrad Malawski

More from Konrad Malawski (20)

Recently uploaded

Recently uploaded (20)

Scalding - Hadoop Word Count in LESS than 70 lines of code