Scalding - Hadoop Word Count in LESS than 70 lines of code
Upcoming SlideShare
Loading in...5
×
 

Scalding - Hadoop Word Count in LESS than 70 lines of code

on

  • 6,515 views

Twitter Scalding is built on top of Cascading, which is built on top of Hadoop. It's basically a very nice to read and extend DSL for writing map reduce jobs.

Twitter Scalding is built on top of Cascading, which is built on top of Hadoop. It's basically a very nice to read and extend DSL for writing map reduce jobs.

Statistics

Views

Total Views
6,515
Views on SlideShare
4,767
Embed Views
1,748

Actions

Likes
13
Downloads
75
Comments
0

10 Embeds 1,748

http://www.scoop.it 704
http://www.cnblogs.com 511
https://twitter.com 366
http://wei-li.cnblogs.com 107
https://softwaremill.com 32
http://www.linkedin.com 16
https://www.linkedin.com 8
http://www.haogongju.net 2
https://confluence.twitter.biz 1
http://localhost 1
More...

Accessibility

Upload Details

Uploaded via as Adobe PDF

Usage Rights

CC Attribution-NonCommercial-ShareAlike LicenseCC Attribution-NonCommercial-ShareAlike LicenseCC Attribution-NonCommercial-ShareAlike License

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Scalding - Hadoop Word Count in LESS than 70 lines of code Scalding - Hadoop Word Count in LESS than 70 lines of code Presentation Transcript

    • ScaldingHadoop Word Countin < 70 lines of code Konrad ktoso Malawski JARCamp #3 12.04.2013
    • ScaldingHadoop Word Count in 4 lines of code Konrad ktoso Malawski JARCamp #3 12.04.2013
    • softwaremill.com / java.pl / sckrk.com / geecon.org / krakowscala.pl / gdgkrakow.pl
    • Agenda
    • AgendaWhy Scalding? (10%)
    • AgendaWhy Scalding? (10%) +
    • AgendaWhy Scalding? (10%) +Hadoop Basics (20%)
    • AgendaWhy Scalding? (10%) +Hadoop Basics (20%) +
    • Agenda Why Scalding? (10%) + Hadoop Basics (20%) +Enter Cascading (40%)
    • Agenda Why Scalding? (10%) + Hadoop Basics (20%) +Enter Cascading (40%) +
    • Agenda Why Scalding? (10%) + Hadoop Basics (20%) +Enter Cascading (40%) + Hello Scalding (30%)
    • Agenda Why Scalding? (10%) + Hadoop Basics (20%) +Enter Cascading (40%) + Hello Scalding (30%) =
    • Agenda Why Scalding? (10%) + Hadoop Basics (20%) +Enter Cascading (40%) + Hello Scalding (30%) = 100%
    • Why Scalding? Word Count in Typestype Word = Stringtype Count = IntString => Map[Word, Count]
    • Why Scalding? Word Count in Scala
    • Why Scalding? Word Count in Scalaval text = "a a a b b"
    • Why Scalding? Word Count in Scalaval text = "a a a b b"def wordCount(text: String): Map[Word, Count] =
    • Why Scalding? Word Count in Scalaval text = "a a a b b"def wordCount(text: String): Map[Word, Count] = text
    • Why Scalding? Word Count in Scalaval text = "a a a b b"def wordCount(text: String): Map[Word, Count] = text .split(" ")
    • Why Scalding? Word Count in Scalaval text = "a a a b b"def wordCount(text: String): Map[Word, Count] = text .split(" ") .map(a => (a, 1))
    • Why Scalding? Word Count in Scalaval text = "a a a b b"def wordCount(text: String): Map[Word, Count] = text .split(" ") .map(a => (a, 1)) .groupBy(_._1)
    • Why Scalding? Word Count in Scalaval text = "a a a b b"def wordCount(text: String): Map[Word, Count] = text .split(" ") .map(a => (a, 1)) .groupBy(_._1) .map { a => a._1 -> a._2.map(_._2).sum }
    • Why Scalding? Word Count in Scalaval text = "a a a b b"def wordCount(text: String): Map[Word, Count] = text .split(" ") .map(a => (a, 1)) .groupBy(_._1) .map { a => a._1 -> a._2.map(_._2).sum }wordCount(text) should equal (Map("a" -> 3), ("b" -> 2)))
    • Stuff > MemoryScala collections... fun but, memory bound!val text = "so many words... waaah! ..." text .split(" ") .map(a => (a, 1)) .groupBy(_._1) .map(a => (a._1, a._2.map(_._2).sum))
    • Stuff > MemoryScala collections... fun but, memory bound! in Memoryval text = "so many words... waaah! ..." text .split(" ") .map(a => (a, 1)) .groupBy(_._1) .map(a => (a._1, a._2.map(_._2).sum))
    • Stuff > MemoryScala collections... fun but, memory bound! in Memoryval text = "so many words... waaah! ..." in Memory text .split(" ") .map(a => (a, 1)) .groupBy(_._1) .map(a => (a._1, a._2.map(_._2).sum))
    • Stuff > MemoryScala collections... fun but, memory bound! in Memoryval text = "so many words... waaah! ..." in Memory text in Memory .split(" ") .map(a => (a, 1)) .groupBy(_._1) .map(a => (a._1, a._2.map(_._2).sum))
    • Stuff > MemoryScala collections... fun but, memory bound! in Memoryval text = "so many words... waaah! ..." in Memory text in Memory .split(" ") .map(a => (a, 1)) in Memory .groupBy(_._1) .map(a => (a._1, a._2.map(_._2).sum))
    • Stuff > MemoryScala collections... fun but, memory bound! in Memoryval text = "so many words... waaah! ..." in Memory text in Memory .split(" ") .map(a => (a, 1)) in Memory .groupBy(_._1) .map(a => (a._1, a._2.map(_._2).sum)) in Memory
    • Apache Hadoop (HDFS + MR) http://hadoop.apache.org/
    • Why Scalding? Word Count in Hadoop MRpackage org.myorg;import org.apache.hadoop.fs.Path;import org.apache.hadoop.io.IntWritable;import org.apache.hadoop.io.LongWritable;import org.apache.hadoop.io.Text;import org.apache.hadoop.mapred.*;import java.io.IOException;import java.util.Iterator;import java.util.StringTokenizer;public class WordCount { public static class Map extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throIOException { String line = value.toString(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasMoreTokens()) { word.set(tokenizer.nextToken()); output.collect(word, one);
    • private final static IntWritable one = new IntWritable(1); Why Scalding? private Text word = new Text(); public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throIOException { String line = value.toString(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasMoreTokens()) { word.set(tokenizer.nextToken()); Word Count in Hadoop MR output.collect(word, one); } } } public static class Reduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporterreporter) throws IOException { int sum = 0; while (values.hasNext()) { sum += values.next().get(); } output.collect(key, new IntWritable(sum)); } } public static void main(String[] args) throws Exception { JobConf conf = new JobConf(WordCount.class); conf.setJobName("wordcount"); conf.setOutputKeyClass(Text.class); conf.setOutputValueClass(IntWritable.class); conf.setMapperClass(Map.class); conf.setCombinerClass(Reduce.class); conf.setReducerClass(Reduce.class); conf.setInputFormat(TextInputFormat.class); conf.setOutputFormat(TextOutputFormat.class); FileInputFormat.setInputPaths(conf, new Path(args[0])); FileOutputFormat.setOutputPath(conf, new Path(args[1])); JobClient.runJob(conf); }}
    • Trivia: How old is Hadoop?
    • Cascadingwww.cascading.org/
    • Cascadingwww.cascading.org/
    • Cascading is
    • Cascading isTaps & Pipes
    • Cascading isTaps & Pipes & Sinks
    • 1: Distributed Copy
    • 1: Distributed Copy
    • 1: Distributed Copy// source TapTap inTap = new Hfs(new TextDelimited(true, "t"), inPath);
    • 1: Distributed Copy// source TapTap inTap = new Hfs(new TextDelimited(true, "t"), inPath);// sink TapTap outTap = new Hfs(new TextDelimited(true, "t"), outPath);
    • 1: Distributed Copy// source TapTap inTap = new Hfs(new TextDelimited(true, "t"), inPath);// sink TapTap outTap = new Hfs(new TextDelimited(true, "t"), outPath);// a Pipe, connects tapsPipe copyPipe = new Pipe("copy");
    • 1: Distributed Copy// source TapTap inTap = new Hfs(new TextDelimited(true, "t"), inPath);// sink TapTap outTap = new Hfs(new TextDelimited(true, "t"), outPath);// a Pipe, connects tapsPipe copyPipe = new Pipe("copy");// build the FlowFlowDef flowDef = FlowDef.flowDef()
    • 1: Distributed Copy// source TapTap inTap = new Hfs(new TextDelimited(true, "t"), inPath);// sink TapTap outTap = new Hfs(new TextDelimited(true, "t"), outPath);// a Pipe, connects tapsPipe copyPipe = new Pipe("copy");// build the FlowFlowDef flowDef = FlowDef.flowDef() .addSource( copyPipe, inTap )
    • 1: Distributed Copy// source TapTap inTap = new Hfs(new TextDelimited(true, "t"), inPath);// sink TapTap outTap = new Hfs(new TextDelimited(true, "t"), outPath);// a Pipe, connects tapsPipe copyPipe = new Pipe("copy");// build the FlowFlowDef flowDef = FlowDef.flowDef() .addSource(copyPipe, inTap) .addTailSink(copyPipe, outTap);
    • 1: Distributed Copy// source TapTap inTap = new Hfs(new TextDelimited(true, "t"), inPath);// sink TapTap outTap = new Hfs(new TextDelimited(true, "t"), outPath);// a Pipe, connects tapsPipe copyPipe = new Pipe("copy");// build the FlowFlowDef flowDef = FlowDef.flowDef() .addSource(copyPipe, inTap) .addTailSink(copyPipe, outTap);// run!flowConnector.connect(flowDef).complete();
    • 1. DCP - Full Codepublic class Main {public static void main(String[] args ) { String inPath = args[0]; String outPath = args[1]; Properties props = new Properties(); AppProps.setApplicationJarClass(properties, Main.class); HadoopFlowConnector flowConnector = new HadoopFlowConnector(props); Tap inTap = new Hfs( new TextDelimited(true, "t"), inPath); Tap outTap = new Hfs(new TextDelimited(true, "t"), outPath); Pipe copyPipe = new Pipe("copy"); FlowDef flowDef = FlowDef.flowDef() .addSource(copyPipe, inTap) .addTailSink(copyPipe, outTap); flowConnector.connect(flowDef).complete();}}
    • 1. DCP - Full Codepublic class Main {public static void main(String[] args ) { String inPath = args[0]; String outPath = args[1]; Properties props = new Properties(); AppProps.setApplicationJarClass(properties, Main.class); HadoopFlowConnector flowConnector = new HadoopFlowConnector(props); Tap inTap = new Hfs( new TextDelimited(true, "t"), inPath); Tap outTap = new Hfs(new TextDelimited(true, "t"), outPath); Pipe copyPipe = new Pipe("copy"); FlowDef flowDef = FlowDef.flowDef() .addSource(copyPipe, inTap) .addTailSink(copyPipe, outTap); flowConnector.connect(flowDef).complete();}}
    • 1. DCP - Full Codepublic class Main {public static void main(String[] args ) { String inPath = args[0]; String outPath = args[1]; Properties props = new Properties(); AppProps.setApplicationJarClass(properties, Main.class); HadoopFlowConnector flowConnector = new HadoopFlowConnector(props); Tap inTap = new Hfs( new TextDelimited(true, "t"), inPath); Tap outTap = new Hfs(new TextDelimited(true, "t"), outPath); Pipe copyPipe = new Pipe("copy"); FlowDef flowDef = FlowDef.flowDef() .addSource(copyPipe, inTap) .addTailSink(copyPipe, outTap); flowConnector.connect(flowDef).complete();}}
    • 1. DCP - Full Codepublic class Main {public static void main(String[] args ) { String inPath = args[0]; String outPath = args[1]; Properties props = new Properties(); AppProps.setApplicationJarClass(properties, Main.class); HadoopFlowConnector flowConnector = new HadoopFlowConnector(props); Tap inTap = new Hfs( new TextDelimited(true, "t"), inPath); Tap outTap = new Hfs(new TextDelimited(true, "t"), outPath); Pipe copyPipe = new Pipe("copy"); FlowDef flowDef = FlowDef.flowDef() .addSource(copyPipe, inTap) .addTailSink(copyPipe, outTap); flowConnector.connect(flowDef).complete();}}
    • 1. DCP - Full Codepublic class Main {public static void main(String[] args ) { String inPath = args[0]; String outPath = args[1]; Properties props = new Properties(); AppProps.setApplicationJarClass(properties, Main.class); HadoopFlowConnector flowConnector = new HadoopFlowConnector(props); Tap inTap = new Hfs( new TextDelimited(true, "t"), inPath); Tap outTap = new Hfs(new TextDelimited(true, "t"), outPath); Pipe copyPipe = new Pipe("copy"); FlowDef flowDef = FlowDef.flowDef() .addSource(copyPipe, inTap) .addTailSink(copyPipe, outTap); flowConnector.connect(flowDef).complete();}}
    • 1. DCP - Full Codepublic class Main {public static void main(String[] args ) { String inPath = args[0]; String outPath = args[1]; Properties props = new Properties(); AppProps.setApplicationJarClass(properties, Main.class); HadoopFlowConnector flowConnector = new HadoopFlowConnector(props); Tap inTap = new Hfs( new TextDelimited(true, "t"), inPath); Tap outTap = new Hfs(new TextDelimited(true, "t"), outPath); Pipe copyPipe = new Pipe("copy"); FlowDef flowDef = FlowDef.flowDef() .addSource(copyPipe, inTap) .addTailSink(copyPipe, outTap); flowConnector.connect(flowDef).complete();}}
    • 2: Word CountString docPath = args[ 0 ];String wcPath = args[ 1 ];Properties properties = new Properties();AppProps.setApplicationJarClass( props, Main.class );HadoopFlowConnector flowConnector = new HadoopFlowConnector( props );// create source and sink tapsTap docTap = new Hfs( new TextDelimited( true, "t" ), docPath );Tap wcTap = new Hfs( new TextDelimited( true, "t" ), wcPath );
    • 2: Word Count String docPath = args[ 0 ]; String wcPath = args[ 1 ]; Properties properties = new Properties(); AppProps.setApplicationJarClass( props, Main.class ); HadoopFlowConnector flowConnector = new HadoopFlowConnector( props ); // create source and sink taps Tap docTap = new Hfs( new TextDelimited( true, "t" ), docPath ); Tap wcTap = new Hfs( new TextDelimited( true, "t" ), wcPath ); // specify a regex operation to split the "document" text lines into aken stream
    • 2: Word Count String docPath = args[ 0 ]; String wcPath = args[ 1 ]; Properties properties = new Properties(); AppProps.setApplicationJarClass( props, Main.class ); HadoopFlowConnector flowConnector = new HadoopFlowConnector( props ); // create source and sink taps Tap docTap = new Hfs( new TextDelimited( true, "t" ), docPath ); Tap wcTap = new Hfs( new TextDelimited( true, "t" ), wcPath ); // specify a regex operation to split the "document" text lines into aken stream Fields token = new Fields( "token" ); Fields text = new Fields( "text" ); RegexSplitGenerator splitter = new RegexSplitGenerator( token, "[ [),.]" );
    • 2: Word CountString docPath = args[ 0 ];String wcPath = args[ 1 ];Properties properties = new Properties();AppProps.setApplicationJarClass( props, Main.class );HadoopFlowConnector flowConnector = new HadoopFlowConnector( props );// create source and sink tapsTap docTap = new Hfs( new TextDelimited( true, "t" ), docPath );Tap wcTap = new Hfs( new TextDelimited( true, "t" ), wcPath );Fields token = new Fields( "token" );Fields text = new Fields( "text" );RegexSplitGenerator splitter = new RegexSplitGenerator( token, "[ [](),.]" );// only returns "token"Pipe docPipe = new Each( "token", text, splitter, Fields.RESULTS );// determine the word countsPipe wcPipe = new Pipe( "wc", docPipe );wcPipe = new GroupBy( wcPipe, token );wcPipe = new Every( wcPipe, Fields.ALL, new Count(), Fields.ALL );
    • String wcPath = args[ 1 ]; 2: Word Count 2: Word CountProperties properties = new Properties();AppProps.setApplicationJarClass( props, Main.class );HadoopFlowConnector flowConnector = new HadoopFlowConnector( props );// create source and sink tapsTap docTap = new Hfs( new TextDelimited( true, "t" ), docPath );Tap wcTap = new Hfs( new TextDelimited( true, "t" ), wcPath );Fields token = new Fields( "token" );Fields text = new Fields( "text" );RegexSplitGenerator splitter = new RegexSplitGenerator( token, "[ [](),.]" );// only returns "token"Pipe docPipe = new Each( "token", text, splitter, Fields.RESULTS );// determine the word countsPipe wcPipe = new Pipe( "wc", docPipe );wcPipe = new GroupBy( wcPipe, token );wcPipe = new Every( wcPipe, Fields.ALL, new Count(), Fields.ALL );// connect the taps, pipes, etc., into a flowFlowDef flowDef = FlowDef.flowDef() .setName( "wc" )
    • AppProps.setApplicationJarClass( props, Main.class );HadoopFlowConnector flowConnector = new HadoopFlowConnector( props ); 2: Word Count// create source and sink tapsTap docTap = new Hfs( new TextDelimited( true, "t" ), docPath );Tap wcTap = new Hfs( new TextDelimited( true, "t" ), wcPath );Fields token = new Fields( "token" );Fields text = new Fields( "text" );RegexSplitGenerator splitter = new RegexSplitGenerator( token, "[ [](),.]" );// only returns "token"Pipe docPipe = new Each( "token", text, splitter, Fields.RESULTS );// determine the word countsPipe wcPipe = new Pipe( "wc", docPipe );wcPipe = new GroupBy( wcPipe, token );wcPipe = new Every( wcPipe, Fields.ALL, new Count(), Fields.ALL );// connect the taps, pipes, etc., into a flowFlowDef flowDef = FlowDef.flowDef() .setName( "wc" ) .addSource( docPipe, docTap ) .addTailSink( wcPipe, wcTap );// write a DOT file and run the flow
    • AppProps.setApplicationJarClass( props, Main.class );HadoopFlowConnector flowConnector = new HadoopFlowConnector( props ); 2: Word Count// create source and sink tapsTap docTap = new Hfs( new TextDelimited( true, "t" ), docPath );Tap wcTap = new Hfs( new TextDelimited( true, "t" ), wcPath );Fields token = new Fields( "token" );Fields text = new Fields( "text" );RegexSplitGenerator splitter = new RegexSplitGenerator( token, "[ [](),.]" );// only returns "token"Pipe docPipe = new Each( "token", text, splitter, Fields.RESULTS );// determine the word countsPipe wcPipe = new Pipe( "wc", docPipe );wcPipe = new GroupBy( wcPipe, token );wcPipe = new Every( wcPipe, Fields.ALL, new Count(), Fields.ALL );// connect the taps, pipes, etc., into a flowFlowDef flowDef = FlowDef.flowDef() .setName( "wc" ) .addSource( docPipe, docTap ) .addTailSink( wcPipe, wcTap );// write a DOT file and run the flow
    • AppProps.setApplicationJarClass( props, Main.class );HadoopFlowConnector flowConnector = new HadoopFlowConnector( props ); 2: Word Count// create source and sink tapsTap docTap = new Hfs( new TextDelimited( true, "t" ), docPath );Tap wcTap = new Hfs( new TextDelimited( true, "t" ), wcPath );Fields token = new Fields( "token" );Fields text = new Fields( "text" );RegexSplitGenerator splitter = new RegexSplitGenerator( token, "[ [](),.]" );// only returns "token"Pipe docPipe = new Each( "token", text, splitter, Fields.RESULTS );// determine the word countsPipe wcPipe = new Pipe( "wc", docPipe );wcPipe = new GroupBy( wcPipe, token );wcPipe = new Every( wcPipe, Fields.ALL, new Count(), Fields.ALL );// connect the taps, pipes, etc., into a flowFlowDef flowDef = FlowDef.flowDef() .setName( "wc" ) .addSource( docPipe, docTap ) .addTailSink( wcPipe, wcTap );// write a DOT file and run the flow
    • AppProps.setApplicationJarClass( props, Main.class );HadoopFlowConnector flowConnector = new HadoopFlowConnector( props ); 2: Word Count// create source and sink tapsTap docTap = new Hfs( new TextDelimited( true, "t" ), docPath );Tap wcTap = new Hfs( new TextDelimited( true, "t" ), wcPath );Fields token = new Fields( "token" );Fields text = new Fields( "text" );RegexSplitGenerator splitter = new RegexSplitGenerator( token, "[ [](),.]" );// only returns "token"Pipe docPipe = new Each( "token", text, splitter, Fields.RESULTS );// determine the word countsPipe wcPipe = new Pipe( "wc", docPipe );wcPipe = new GroupBy( wcPipe, token );wcPipe = new Every( wcPipe, Fields.ALL, new Count(), Fields.ALL );// connect the taps, pipes, etc., into a flowFlowDef flowDef = FlowDef.flowDef() .setName( "wc" ) .addSource( docPipe, docTap ) .addTailSink( wcPipe, wcTap );// write a DOT file and run the flow
    • AppProps.setApplicationJarClass( props, Main.class );HadoopFlowConnector flowConnector = new HadoopFlowConnector( props ); 2: Word Count// create source and sink tapsTap docTap = new Hfs( new TextDelimited( true, "t" ), docPath );Tap wcTap = new Hfs( new TextDelimited( true, "t" ), wcPath );Fields token = new Fields( "token" );Fields text = new Fields( "text" );RegexSplitGenerator splitter = new RegexSplitGenerator( token, "[ [](),.]" );// only returns "token"Pipe docPipe = new Each( "token", text, splitter, Fields.RESULTS );// determine the word countsPipe wcPipe = new Pipe( "wc", docPipe );wcPipe = new GroupBy( wcPipe, token );wcPipe = new Every( wcPipe, Fields.ALL, new Count(), Fields.ALL );// connect the taps, pipes, etc., into a flowFlowDef flowDef = FlowDef.flowDef() .setName( "wc" ) .addSource( docPipe, docTap ) .addTailSink( wcPipe, wcTap );// write a DOT file and run the flow
    • AppProps.setApplicationJarClass( props, Main.class );HadoopFlowConnector flowConnector = new HadoopFlowConnector( props ); 2: Word Count// create source and sink tapsTap docTap = new Hfs( new TextDelimited( true, "t" ), docPath );Tap wcTap = new Hfs( new TextDelimited( true, "t" ), wcPath );Fields token = new Fields( "token" );Fields text = new Fields( "text" );RegexSplitGenerator splitter = new RegexSplitGenerator( token, "[ [](),.]" );// only returns "token"Pipe docPipe = new Each( "token", text, splitter, Fields.RESULTS );// determine the word countsPipe wcPipe = new Pipe( "wc", docPipe );wcPipe = new GroupBy( wcPipe, token );wcPipe = new Every( wcPipe, Fields.ALL, new Count(), Fields.ALL );// connect the taps, pipes, etc., into a flowFlowDef flowDef = FlowDef.flowDef() .setName( "wc" ) .addSource( docPipe, docTap ) .addTailSink( wcPipe, wcTap );// write a DOT file and run the flow
    • AppProps.setApplicationJarClass( props, Main.class );HadoopFlowConnector flowConnector = new HadoopFlowConnector( props ); 2: Word Count// create source and sink tapsTap docTap = new Hfs( new TextDelimited( true, "t" ), docPath );Tap wcTap = new Hfs( new TextDelimited( true, "t" ), wcPath );Fields token = new Fields( "token" );Fields text = new Fields( "text" );RegexSplitGenerator splitter = new RegexSplitGenerator( token, "[ [](),.]" );// only returns "token"Pipe docPipe = new Each( "token", text, splitter, Fields.RESULTS );// determine the word countsPipe wcPipe = new Pipe( "wc", docPipe );wcPipe = new GroupBy( wcPipe, token );wcPipe = new Every( wcPipe, Fields.ALL, new Count(), Fields.ALL );// connect the taps, pipes, etc., into a flowFlowDef flowDef = FlowDef.flowDef() .setName( "wc" ) .addSource( docPipe, docTap ) .addTailSink( wcPipe, wcTap );// write a DOT file and run the flow
    • Fields token = new Fields( "token" ); 2: Word Count Fields text = new Fields( "text" ); RegexSplitGenerator splitter = new RegexSplitGenerator( token, "[ [](),.]" ); // only returns "token" Pipe docPipe = new Each( "token", text, splitter, Fields.RESULTS ); // determine the word counts Pipe wcPipe = new Pipe( "wc", docPipe ); wcPipe = new GroupBy( wcPipe, token ); wcPipe = new Every( wcPipe, Fields.ALL, new Count(), Fields.ALL ); // connect the taps, pipes, etc., into a flow FlowDef flowDef = FlowDef.flowDef() .setName( "wc" ) .addSource( docPipe, docTap ) .addTailSink( wcPipe, wcTap ); // write a DOT file and run the flow Flow wcFlow = flowConnector.connect( flowDef ); wcFlow.writeDOT( "dot/wc.dot" ); wcFlow.complete(); }}
    • Pipe docPipe = new Each( "token", text, splitter, Fields.RESULTS ); 2: Word Count How its made // determine the word counts Pipe wcPipe = new Pipe( "wc", docPipe ); wcPipe = new GroupBy( wcPipe, token ); wcPipe = new Every( wcPipe, Fields.ALL, new Count(), Fields.ALL ); // connect the taps, pipes, etc., into a flow FlowDef flowDef = FlowDef.flowDef() .setName( "wc" ) .addSource( docPipe, docTap ) .addTailSink( wcPipe, wcTap ); // write a DOT file and run the flow Flow wcFlow = flowConnector.connect( flowDef ); wcFlow.writeDOT( "dot/wc.dot" ); wcFlow.complete(); }}
    • Pipe docPipe = new Each( "token", text, splitter, Fields.RESULTS ); 2: Word Count How its made // determine the word counts Pipe wcPipe = new Pipe( "wc", docPipe ); wcPipe = new GroupBy( wcPipe, token ); wcPipe = new Every( wcPipe, Fields.ALL, new Count(), Fields.ALL ); // connect the taps, pipes, etc., into a flow FlowDef flowDef = FlowDef.flowDef() .setName( "wc" ) .addSource( docPipe, docTap ) .addTailSink( wcPipe, wcTap ); // write a DOT file and run the flow Flow wcFlow = flowConnector.connect( flowDef ); wcFlow.writeDOT( "dot/wc.dot" ); wcFlow.complete(); } Graph representation of jobs!}
    • 2: Word CountHow its madehttp://www.cascading.org/2012/07/09/cascading-for-the-impatient-part-2/
    • How its made
    • How its madeval flow = FlowDef
    • How its madeval flow = FlowDef// pseudo code...
    • How its madeval flow = FlowDef// pseudo code...val jobs: List[MRJob] = flowConnector(flow)
    • How its madeval flow = FlowDef// pseudo code...val jobs: List[MRJob] = flowConnector(flow)// pseudo code...
    • How its madeval flow = FlowDef// pseudo code...val jobs: List[MRJob] = flowConnector(flow)// pseudo code...HadoopCluster.execute(jobs)
    • How its madeval flow = FlowDef// pseudo code...val jobs: List[MRJob] = flowConnector(flow)// pseudo code...HadoopCluster.execute(jobs)
    • Cascading tipsPipe assembly = new Pipe( "assembly" );assembly = new Each( assembly, DebugLevel.VERBOSE, new Debug() );// ...// head and tail have same nameFlowDef flowDef = new FlowDef() .setName( "debug" ) .addSource( "assembly", source ) .addSink( "assembly", sink ) .addTail( assembly );
    • Cascading tipsPipe assembly = new Pipe( "assembly" );assembly = new Each( assembly, DebugLevel.VERBOSE, new Debug() );// ...// head and tail have same nameFlowDef flowDef = new FlowDef() .setName( "debug" ) .addSource( "assembly", source ) .addSink( "assembly", sink ) .addTail( assembly );flowDef.setDebugLevel( DebugLevel.NONE );
    • Cascading tipsPipe assembly = new Pipe( "assembly" );assembly = new Each( assembly, DebugLevel.VERBOSE, new Debug() );// ...// head and tail have same nameFlowDef flowDef = new FlowDef() .setName( "debug" ) .addSource( "assembly", source ) .addSink( "assembly", sink ) .addTail( assembly );flowDef.setDebugLevel( DebugLevel.NONE ); flowConnector will NOT create the Debug pipe!
    • Scalding = + Twitter Scaldinggithub.com/twitter/scalding
    • Scalding API
    • map
    • mapScala:val data = 1 :: 2 :: 3 :: Nil
    • mapScala:val data = 1 :: 2 :: 3 :: Nilval doubled = data map { _ * 2 }
    • mapScala:val data = 1 :: 2 :: 3 :: Nilval doubled = data map { _ * 2 } // Int => Int
    • mapScala:val data = 1 :: 2 :: 3 :: Nilval doubled = data map { _ * 2 } // Int => Int
    • map Scala: val data = 1 :: 2 :: 3 :: Nil val doubled = data map { _ * 2 } // Int => IntScalding: IterableSource(data)
    • map Scala: val data = 1 :: 2 :: 3 :: Nil val doubled = data map { _ * 2 } // Int => IntScalding: IterableSource(data) .map(number -> doubled) { n: Int => n * 2 }
    • map Scala: val data = 1 :: 2 :: 3 :: Nil val doubled = data map { _ * 2 } // Int => IntScalding: IterableSource(data) .map(number -> doubled) { n: Int => n * 2 } // Int => Int
    • map Scala: val data = 1 :: 2 :: 3 :: Nil val doubled = data map { _ * 2 } // Int => IntScalding: IterableSource(data) .map(number -> doubled) { n: Int => n * 2 } available in Pipe // Int => Int
    • map Scala: val data = 1 :: 2 :: 3 :: Nil val doubled = data map { _ * 2 } // Int => IntScalding: IterableSource(data) .map(number -> doubled) { n: Int => n * 2 } stays in Pipe available in Pipe // Int => Int
    • map Scala: val data = 1 :: 2 :: 3 :: Nil val doubled = data map { _ * 2 } // Int => IntScalding: IterableSource(data) .map(number -> doubled) { n: Int => n * 2 } must choose type! // Int => Int
    • mapTo
    • mapToScala:var data = 1 :: 2 :: 3 :: Nil
    • mapToScala:var data = 1 :: 2 :: 3 :: Nilval doubled = data map { _ * 2 }
    • mapToScala:var data = 1 :: 2 :: 3 :: Nilval doubled = data map { _ * 2 }data = null
    • mapToScala:var data = 1 :: 2 :: 3 :: Nilval doubled = data map { _ * 2 }data = null // Int => Int
    • mapToScala:var data = 1 :: 2 :: 3 :: Nilval doubled = data map { _ * 2 }data = null // Int => Int release reference
    • mapToScala:var data = 1 :: 2 :: 3 :: Nilval doubled = data map { _ * 2 }data = null // Int => Int release reference
    • mapTo Scala: var data = 1 :: 2 :: 3 :: Nil val doubled = data map { _ * 2 } data = null // Int => Int release referenceScalding: IterableSource(data)
    • mapTo Scala: var data = 1 :: 2 :: 3 :: Nil val doubled = data map { _ * 2 } data = null // Int => Int release referenceScalding: IterableSource(data) .mapTo(doubled) { n: Int => n * 2 }
    • mapTo Scala: var data = 1 :: 2 :: 3 :: Nil val doubled = data map { _ * 2 } data = null // Int => Int release referenceScalding: IterableSource(data) .mapTo(doubled) { n: Int => n * 2 } // Int => Int
    • mapTo Scala: var data = 1 :: 2 :: 3 :: Nil val doubled = data map { _ * 2 } data = null // Int => Int release referenceScalding: IterableSource(data) .mapTo(doubled) { n: Int => n * 2 } doubled stays in Pipe // Int => Int
    • mapTo Scala: var data = 1 :: 2 :: 3 :: Nil val doubled = data map { _ * 2 } data = null // Int => Int release referenceScalding: IterableSource(data) .mapTo(doubled) { n: Int => n * 2 } number is removed doubled stays in Pipe // Int => Int
    • flatMap
    • flatMapScala:val data = "1" :: "2,2" :: "3,3,3" :: Nil // List[String]
    • flatMapScala:val data = "1" :: "2,2" :: "3,3,3" :: Nil // List[String]val numbers = data flatMap { line => // String
    • flatMapScala:val data = "1" :: "2,2" :: "3,3,3" :: Nil // List[String]val numbers = data flatMap { line => // String line.split(",") // Array[String]
    • flatMapScala:val data = "1" :: "2,2" :: "3,3,3" :: Nil // List[String]val numbers = data flatMap { line => // String line.split(",") // Array[String]} map { _.toInt } // List[Int]
    • flatMapScala:val data = "1" :: "2,2" :: "3,3,3" :: Nil // List[String]val numbers = data flatMap { line => // String line.split(",") // Array[String]} map { _.toInt } // List[Int]numbers // List[Int]
    • flatMapScala:val data = "1" :: "2,2" :: "3,3,3" :: Nil // List[String]val numbers = data flatMap { line => // String line.split(",") // Array[String]} map { _.toInt } // List[Int]numbers // List[Int]numbers should equal (List(1, 2, 2, 3, 3, 3))
    • flatMapScala:val data = "1" :: "2,2" :: "3,3,3" :: Nil // List[String]val numbers = data flatMap { line => // String line.split(",") // Array[String]} map { _.toInt } // List[Int]numbers // List[Int]numbers should equal (List(1, 2, 2, 3, 3, 3))
    • flatMap Scala: val data = "1" :: "2,2" :: "3,3,3" :: Nil // List[String] val numbers = data flatMap { line => // String line.split(",") // Array[String] } map { _.toInt } // List[Int] numbers // List[Int] numbers should equal (List(1, 2, 2, 3, 3, 3))Scalding: TextLine(data) // like List[String]
    • flatMap Scala: val data = "1" :: "2,2" :: "3,3,3" :: Nil // List[String] val numbers = data flatMap { line => // String line.split(",") // Array[String] } map { _.toInt } // List[Int] numbers // List[Int] numbers should equal (List(1, 2, 2, 3, 3, 3))Scalding: TextLine(data) // like List[String] .flatMap(line -> word) { _.split(",") } // like List[String]
    • flatMap Scala: val data = "1" :: "2,2" :: "3,3,3" :: Nil // List[String] val numbers = data flatMap { line => // String line.split(",") // Array[String] } map { _.toInt } // List[Int] numbers // List[Int] numbers should equal (List(1, 2, 2, 3, 3, 3))Scalding: TextLine(data) // like List[String] .flatMap(line -> word) { _.split(",") } // like List[String] .map(word -> number) { _.toInt } // like List[Int]
    • flatMap Scala: val data = "1" :: "2,2" :: "3,3,3" :: Nil // List[String] val numbers = data flatMap { line => // String line.split(",") // Array[String] } map { _.toInt } // List[Int] numbers // List[Int] numbers should equal (List(1, 2, 2, 3, 3, 3))Scalding: TextLine(data) // like List[String] .flatMap(line -> word) { _.split(",") } // like List[String] .map(word -> number) { _.toInt } // like List[Int] MR map outside
    • flatMap
    • flatMapScala:val data = "1" :: "2,2" :: "3,3,3" :: Nil // List[String]
    • flatMapScala:val data = "1" :: "2,2" :: "3,3,3" :: Nil // List[String]val numbers = data flatMap { line => // String
    • flatMapScala:val data = "1" :: "2,2" :: "3,3,3" :: Nil // List[String]val numbers = data flatMap { line => // String line.split(",").map(_.toInt) // Array[Int]
    • flatMapScala:val data = "1" :: "2,2" :: "3,3,3" :: Nil // List[String]val numbers = data flatMap { line => // String line.split(",").map(_.toInt) // Array[Int]}
    • flatMapScala:val data = "1" :: "2,2" :: "3,3,3" :: Nil // List[String]val numbers = data flatMap { line => // String line.split(",").map(_.toInt) // Array[Int]}numbers // List[Int]
    • flatMapScala:val data = "1" :: "2,2" :: "3,3,3" :: Nil // List[String]val numbers = data flatMap { line => // String line.split(",").map(_.toInt) // Array[Int]}numbers // List[Int]numbers should equal (List(1, 2, 2, 3, 3, 3))
    • flatMapScala:val data = "1" :: "2,2" :: "3,3,3" :: Nil // List[String]val numbers = data flatMap { line => // String line.split(",").map(_.toInt) // Array[Int]}numbers // List[Int]numbers should equal (List(1, 2, 2, 3, 3, 3))
    • flatMap Scala: val data = "1" :: "2,2" :: "3,3,3" :: Nil // List[String] val numbers = data flatMap { line => // String line.split(",").map(_.toInt) // Array[Int] } numbers // List[Int] numbers should equal (List(1, 2, 2, 3, 3, 3))Scalding: TextLine(data) // like List[String]
    • flatMap Scala: val data = "1" :: "2,2" :: "3,3,3" :: Nil // List[String] val numbers = data flatMap { line => // String line.split(",").map(_.toInt) // Array[Int] } numbers // List[Int] numbers should equal (List(1, 2, 2, 3, 3, 3))Scalding: TextLine(data) // like List[String] .flatMap(line -> word) { _.split(",").map(_.toInt) }
    • flatMap Scala: val data = "1" :: "2,2" :: "3,3,3" :: Nil // List[String] val numbers = data flatMap { line => // String line.split(",").map(_.toInt) // Array[Int] } numbers // List[Int] numbers should equal (List(1, 2, 2, 3, 3, 3))Scalding: TextLine(data) // like List[String] .flatMap(line -> word) { _.split(",").map(_.toInt) } // like List[Int]
    • flatMap Scala: val data = "1" :: "2,2" :: "3,3,3" :: Nil // List[String] val numbers = data flatMap { line => // String line.split(",").map(_.toInt) // Array[Int] } numbers // List[Int] numbers should equal (List(1, 2, 2, 3, 3, 3))Scalding: TextLine(data) // like List[String] .flatMap(line -> word) { _.split(",").map(_.toInt) } // like List[Int] map inside Scala
    • groupBy
    • groupByScala:val data = 1 :: 2 :: 30 :: 42 :: Nil // List[Int]
    • groupByScala:val data = 1 :: 2 :: 30 :: 42 :: Nil // List[Int]val groups = data groupBy { _ < 10 }
    • groupByScala:val data = 1 :: 2 :: 30 :: 42 :: Nil // List[Int]val groups = data groupBy { _ < 10 }groups // Map[Boolean, Int]
    • groupByScala:val data = 1 :: 2 :: 30 :: 42 :: Nil // List[Int]val groups = data groupBy { _ < 10 }groups // Map[Boolean, Int]groups(true) should equal (List(1, 2))
    • groupByScala:val data = 1 :: 2 :: 30 :: 42 :: Nil // List[Int]val groups = data groupBy { _ < 10 }groups // Map[Boolean, Int]groups(true) should equal (List(1, 2))groups(false) should equal (List(30, 42))
    • groupByScala:val data = 1 :: 2 :: 30 :: 42 :: Nil // List[Int]val groups = data groupBy { _ < 10 }groups // Map[Boolean, Int]groups(true) should equal (List(1, 2))groups(false) should equal (List(30, 42))
    • groupBy Scala: val data = 1 :: 2 :: 30 :: 42 :: Nil // List[Int] val groups = data groupBy { _ < 10 } groups // Map[Boolean, Int] groups(true) should equal (List(1, 2)) groups(false) should equal (List(30, 42))Scalding: IterableSource(List(1, 2, 30, 42), num)
    • groupBy Scala: val data = 1 :: 2 :: 30 :: 42 :: Nil // List[Int] val groups = data groupBy { _ < 10 } groups // Map[Boolean, Int] groups(true) should equal (List(1, 2)) groups(false) should equal (List(30, 42))Scalding: IterableSource(List(1, 2, 30, 42), num) .map(num -> lessThanTen) { i: Int => i < 10 }
    • groupBy Scala: val data = 1 :: 2 :: 30 :: 42 :: Nil // List[Int] val groups = data groupBy { _ < 10 } groups // Map[Boolean, Int] groups(true) should equal (List(1, 2)) groups(false) should equal (List(30, 42))Scalding: IterableSource(List(1, 2, 30, 42), num) .map(num -> lessThanTen) { i: Int => i < 10 } .groupBy(lessThanTen) { _.size(size) }
    • groupBy Scala: val data = 1 :: 2 :: 30 :: 42 :: Nil // List[Int] val groups = data groupBy { _ < 10 } groups // Map[Boolean, Int] groups(true) should equal (List(1, 2)) groups(false) should equal (List(30, 42))Scalding: IterableSource(List(1, 2, 30, 42), num) .map(num -> lessThanTen) { i: Int => i < 10 } .groupBy(lessThanTen) { _.size(size) } groups all with == value
    • groupBy Scala: val data = 1 :: 2 :: 30 :: 42 :: Nil // List[Int] val groups = data groupBy { _ < 10 } groups // Map[Boolean, Int] groups(true) should equal (List(1, 2)) groups(false) should equal (List(30, 42))Scalding: IterableSource(List(1, 2, 30, 42), num) .map(num -> lessThanTen) { i: Int => i < 10 } .groupBy(lessThanTen) { _.size(size) } groups all with == value => size
    • groupByScalding:
    • groupByScalding: IterableSource(List(1, 2, 30, 42), num)
    • groupByScalding: IterableSource(List(1, 2, 30, 42), num) .map(num -> lessThanTen) { i: Int => i < 10 }
    • groupByScalding: IterableSource(List(1, 2, 30, 42), num) .map(num -> lessThanTen) { i: Int => i < 10 } .groupBy(lessThanTen) { _.sum(total) }
    • groupByScalding: IterableSource(List(1, 2, 30, 42), num) .map(num -> lessThanTen) { i: Int => i < 10 } .groupBy(lessThanTen) { _.sum(total) } total = [3, 74]
    • Scalding API
    • Scalding API project / discard
    • Scalding API project / discard map / mapTo
    • Scalding API project / discard map / mapTo flatMap / flatMapTo
    • Scalding API project / discard map / mapTo flatMap / flatMapTo rename
    • Scalding API project / discard map / mapTo flatMap / flatMapTo rename filter
    • Scalding API project / discard map / mapTo flatMap / flatMapTo rename filter unique
    • Scalding API project / discard map / mapTo flatMap / flatMapTo rename filter uniquegroupBy / groupAll / groupRandom / shuffle
    • Scalding API project / discard map / mapTo flatMap / flatMapTo rename filter uniquegroupBy / groupAll / groupRandom / shuffle limit
    • Scalding API project / discard map / mapTo flatMap / flatMapTo rename filter uniquegroupBy / groupAll / groupRandom / shuffle limit debug
    • Scalding API project / discard map / mapTo flatMap / flatMapTo rename filter uniquegroupBy / groupAll / groupRandom / shuffle limit debug Group operations
    • Scalding API project / discard map / mapTo flatMap / flatMapTo rename filter uniquegroupBy / groupAll / groupRandom / shuffle limit debug Group operations joins
    • Distributed Copy in Scaldingclass WordCountJob(args: Args) extends Job(args) {
    • Distributed Copy in Scaldingclass WordCountJob(args: Args) extends Job(args) { val input = Tsv(args("input")) val output = Tsv(args("output"))
    • Distributed Copy in Scaldingclass WordCountJob(args: Args) extends Job(args) { val input = Tsv(args("input")) val output = Tsv(args("output")) input.read.write(output)}
    • Distributed Copy in Scaldingclass WordCountJob(args: Args) extends Job(args) { val input = Tsv(args("input")) val output = Tsv(args("output")) input.read.write(output)} The End.
    • Main Class - "Runner"import org.apache.hadoop.util.ToolRunnerimport com.twitter.scaldingobject ScaldingJobRunner extends App { ToolRunner.run(new Configuration, new scalding.Tool, args)}
    • Main Class - "Runner"import org.apache.hadoop.util.ToolRunnerimport com.twitter.scaldingobject ScaldingJobRunner extends App { from App ToolRunner.run(new Configuration, new scalding.Tool, args)}
    • Word Count in Scaldingclass WordCountJob(args: Args) extends Job(args) {}
    • Word Count in Scaldingclass WordCountJob(args: Args) extends Job(args) { val inputFile = args("input") val outputFile = args("output")}
    • Word Count in Scaldingclass WordCountJob(args: Args) extends Job(args) { val inputFile = args("input") val outputFile = args("output") TextLine(inputFile)}
    • Word Count in Scaldingclass WordCountJob(args: Args) extends Job(args) { val inputFile = args("input") val outputFile = args("output") TextLine(inputFile) .flatMap(line -> word) { line: String => tokenize(line) } def tokenize(text: String): Array[String] = implemented}
    • Word Count in Scaldingclass WordCountJob(args: Args) extends Job(args) { val inputFile = args("input") val outputFile = args("output") TextLine(inputFile) .flatMap(line -> word) { line: String => tokenize(line) } .groupBy(word) { group => group.size(count) } def tokenize(text: String): Array[String] = implemented}
    • Word Count in Scaldingclass WordCountJob(args: Args) extends Job(args) { val inputFile = args("input") val outputFile = args("output") TextLine(inputFile) .flatMap(line -> word) { line: String => tokenize(line) } .groupBy(word) { group => group.size } def tokenize(text: String): Array[String] = implemented}
    • Word Count in Scaldingclass WordCountJob(args: Args) extends Job(args) { val inputFile = args("input") val outputFile = args("output") TextLine(inputFile) .flatMap(line -> word) { line: String => tokenize(line) } .groupBy(word) { _.size } def tokenize(text: String): Array[String] = implemented}
    • Word Count in Scaldingclass WordCountJob(args: Args) extends Job(args) { val inputFile = args("input") val outputFile = args("output") TextLine(inputFile) .flatMap(line -> word) { line: String => tokenize(line) } .groupBy(word) { _.size } .write(Tsv(outputFile)) def tokenize(text: String): Array[String] = implemented}
    • Word Count in Scalding class WordCountJob(args: Args) extends Job(args) { val inputFile = args("input") val outputFile = args("output")4{ TextLine(inputFile) .flatMap(line -> word) { line: String => tokenize(line) } .groupBy(word) { _.size } .write(Tsv(outputFile)) def tokenize(text: String): Array[String] = implemented }
    • Word Count in Scalding
    • Word Count in Scaldingrun pl.project13.scala.oculus.job.WordCountJob --tool.graph
    • Word Count in Scaldingrun pl.project13.scala.oculus.job.WordCountJob --tool.graph=> pl.project13.scala.oculus.job.WordCountJob0.dot
    • Word Count in Scaldingrun pl.project13.scala.oculus.job.WordCountJob --tool.graph=> pl.project13.scala.oculus.job.WordCountJob0.dotMAP
    • Word Count in Scaldingrun pl.project13.scala.oculus.job.WordCountJob --tool.graph=> pl.project13.scala.oculus.job.WordCountJob0.dotMAPRED
    • Word Count in ScaldingTextLine(inputFile) .flatMap(line -> word) { line: String => tokenize(line) } .groupBy(word) { _.size(count) } .write(Tsv(outputFile))
    • Word Count in ScaldingTextLine(inputFile) .flatMap(line -> word) { line: String => tokenize(line) } .groupBy(word) { _.size(count) } .write(Tsv(outputFile))
    • Word Count in ScaldingTextLine(inputFile) .flatMap(line -> word) { line: String => tokenize(line) } .groupBy(word) { _.size(count) } .write(Tsv(outputFile))
    • Why Scalding?
    • Why Scalding? Hadoop inside
    • Why Scalding? Hadoop insideCascading abstractions
    • Why Scalding? Hadoop insideCascading abstractions Scala conciseness
    • Ask Stuff! Dzięki! Thanks! ありがとう!Konrad Malawski @ java.plt: ktosopl / g: ktosob: blog.project13.pl