0
Introduction to


Cascading
What is Cascading?
What is Cascading?


• abstraction over MapReduce
What is Cascading?


• abstraction over MapReduce
• API for data-processing workflows
Why use Cascading?
Why use Cascading?


•   MapReduce can be:
Why use Cascading?


•   MapReduce can be:

    •   the wrong level of granularity
Why use Cascading?


•   MapReduce can be:

    •   the wrong level of granularity

    •   cumbersome to chain together
Why use Cascading?
Why use Cascading?

•   Cascading helps create
Why use Cascading?

•   Cascading helps create

    •   higher-level data processing abstractions
Why use Cascading?

•   Cascading helps create

    •   higher-level data processing abstractions

    •   sophisticated d...
Why use Cascading?

•   Cascading helps create

    •   higher-level data processing abstractions

    •   sophisticated d...
Credits

• Cascading written by Chris Wensel
• based on his users guide

        http://bit.ly/cascading
package cascadingtutorial.wordcount;

/**
 * Wordcount example in Cascading
 */
public class Main
  {

  public static voi...
Data Model


Pipes     Filters
Data Model


file
Data Model


file
Data Model


file
Data Model


file
Data Model


file
Data Model


file
Data Model

                   file




file



                   file
Data Model

                    file




file



                    file




          Filters
Tuple
Pipe Assembly
Pipe Assembly




 Tuple Stream
Taps: sources & sinks
Taps: sources & sinks

        Taps
Taps: sources & sinks

        Taps
Taps: sources & sinks

        Taps
Taps: sources & sinks

         Taps




Source
Taps: sources & sinks

           Taps




Source   Sink
Pipe + Taps = Flow

    Flow
Flow


Flow          Flow   Flow

       Flow
n Flows

       Flow


Flow          Flow   Flow

       Flow
n Flows = Cascade

       Flow


Flow          Flow   Flow

       Flow
n Flows = Cascade

       Flow


Flow          Flow      Flow

       Flow




              complex
Example
package cascadingtutorial.wordcount;

/**
 * Wordcount example in Cascading
 */
public class Main
  {

  public static voi...
package cascadingtutorial.wordcount;

/**
 * Wordcount example in Cascading
 */
public class Main
  {

  public static voi...
import cascading.tap.Tap;
import cascading.tuple.Fields;
import java.util.Properties;

/**
 * Wordcount example in Cascadi...
import cascading.tap.Tap;
import cascading.tuple.Fields;
import java.util.Properties;

/**
 * Wordcount example in Cascadi...
import cascading.tap.Tap;
import cascading.tuple.Fields;
import java.util.Properties;

/**
 * Wordcount example in Cascadi...
import cascading.tap.Tap;
import cascading.tuple.Fields;
import java.util.Properties;

/**
 * Wordcount example in Cascadi...
import cascading.tap.Tap;
import cascading.tuple.Fields;
import java.util.Properties;

/**
 * Wordcount example in Cascadi...
/**
 * Wordcount example in Cascading
 */
public class Main
  {

  public static void main( String[] args )
    {
      Sc...
/**
 * Wordcount example in Cascading
 */
public class Main
  { hdfs://master0:54310/user/nmurray/data.txt

 public static...
/**
 * Wordcount example in Cascading
 */
public class Main
  { hdfs://master0:54310/user/nmurray/data.txt        new Hfs(...
/**
 * Wordcount example in Cascading
 */
public class Main
  { hdfs://master0:54310/user/nmurray/data.txt        new Hfs(...
/**
 * Wordcount example in Cascading
 */
public class Main
  { hdfs://master0:54310/user/nmurray/data.txt        new Hfs(...
/**
 * Wordcount example in Cascading
 */
public class Main
  { hdfs://master0:54310/user/nmurray/data.txt        new Hfs(...
/**
 * Wordcount example in Cascading
 */
public class Main
  { hdfs://master0:54310/user/nmurray/data.txt        new Hfs(...
/**
 * Wordcount example in Cascading
 */
public class Main
  {

  public static void main( String[] args )
    {
      Sc...
Scheme outputScheme = new TextLine();

        String inputPath = args[0];
        String outputPath = args[1];

        T...
Scheme outputScheme = new TextLine();

        String inputPath = args[0];
        String outputPath = args[1];

         ...
Pipe Assemblies
Pipe Assemblies
Pipe Assemblies

• Define work against a Tuple Stream
Pipe Assemblies

• Define work against a Tuple Stream
• May have multiple sources and sinks
Pipe Assemblies

• Define work against a Tuple Stream
• May have multiple sources and sinks
 • Splits
Pipe Assemblies

• Define work against a Tuple Stream
• May have multiple sources and sinks
 • Splits
 • Merges
Pipe Assemblies

• Define work against a Tuple Stream
• May have multiple sources and sinks
 • Splits
 • Merges
 • Joins
Pipe Assemblies
Pipe Assemblies
• Pipe
Pipe Assemblies
• Pipe
 • Each
 • GroupBy
 • CoGroup
 • Every
 • SubAssembly
Pipe Assemblies
• Pipe
 • Each
 • GroupBy
 • CoGroup
 • Every
 • SubAssembly
Pipe Assemblies
• Pipe         Applies a
 • Each Function or Filter
 • GroupBy Operation to each
 • CoGroup      Tuple
 • ...
Pipe Assemblies
• Pipe
 • Each
 • GroupBy
 • CoGroup
 • Every
 • SubAssembly
Pipe Assemblies
• Pipe
 • Each
 • GroupBy
 • CoGroup
 • Every
 • SubAssembly
Pipe Assemblies
• Pipe
 • Each
                 Group
 • GroupBy       (& merge)
 • CoGroup
 • Every
 • SubAssembly
Pipe Assemblies
• Pipe
 • Each
 • GroupBy               Joins
 • CoGroup       (inner, outer, left, right)

 • Every
 • Su...
Pipe Assemblies
• Pipe
 • Each
 • GroupBy
 • CoGroup
 • Every
 • SubAssembly
Pipe Assemblies
• Pipe
 • Each
 • GroupBy
 • CoGroup
 • Every         Applies an Aggregator (count,
                  sum)...
Pipe Assemblies
• Pipe
 • Each
 • GroupBy
 • CoGroup
 • Every
 • SubAssembly   Nesting
Each vs. Every
Each vs. Every

• Each() is for individual Tuples
Each vs. Every

• Each() is for individual Tuples
• Every() is for groups of Tuples
Each

new Each(previousPipe,
         argumentSelector,
         operation,
         outputSelector)
Each

new Each(previousPipe,
         argumentSelector,
         operation,
         outputSelector)

Pipe wcPipe =
new Ea...
Each

new Each(previousPipe,
         argumentSelector,
         operation,
         outputSelector)

Pipe wcPipe =
new Ea...
Each

new Each(previousPipe,
         argumentSelector,
         operation,
         outputSelector)

Pipe wcPipe =
new Ea...
import cascading.tap.Tap;
import cascading.tuple.Fields;
import java.util.Properties;

/**
 * Wordcount example in Cascadi...
Each

new Each(previousPipe,
         argumentSelector,
         operation,
         outputSelector)

Pipe wcPipe =
new Ea...
Each

new Each(previousPipe,
         argumentSelector,
         operation,
         outputSelector)

Pipe wcPipe =
new Ea...
Each

new Each(previousPipe,
         argumentSelector,
         operation,
         outputSelector)

Pipe wcPipe =
new Ea...
Each

new Each(previousPipe,
         argumentSelector,
         operation,
         outputSelector)

Pipe wcPipe =
new Ea...
Each
               offset                   line
                 0              the lazy brown fox



Pipe wcPipe =
new ...
Each
               offset                   line
                 0              the lazy brown fox



Pipe wcPipe =
new ...
Each

               X
               offset
                 0
                                        line
             ...
Each

               X   offset
                     0
                                           line
                   ...
Operations: Functions
Operations: Functions

•   Insert()
Operations: Functions

•   Insert()

•   RegexParser() / RegexGenerator() / RegexReplace()
Operations: Functions

•   Insert()

•   RegexParser() / RegexGenerator() / RegexReplace()

•   DateParser() / DateFormatt...
Operations: Functions

•   Insert()

•   RegexParser() / RegexGenerator() / RegexReplace()

•   DateParser() / DateFormatt...
Operations: Functions

•   Insert()

•   RegexParser() / RegexGenerator() / RegexReplace()

•   DateParser() / DateFormatt...
Operations: Functions

•   Insert()

•   RegexParser() / RegexGenerator() / RegexReplace()

•   DateParser() / DateFormatt...
Operations: Filters
Operations: Filters


•   RegexFilter()
Operations: Filters


•   RegexFilter()

•   FilterNull()
Operations: Filters


•   RegexFilter()

•   FilterNull()

•   And() / Or() / Not() / Xor()
Operations: Filters


•   RegexFilter()

•   FilterNull()

•   And() / Or() / Not() / Xor()

•   ExpressionFilter()
Operations: Filters


•   RegexFilter()

•   FilterNull()

•   And() / Or() / Not() / Xor()

•   ExpressionFilter()

•   e...
Each

new Each(previousPipe,
         argumentSelector,
         operation,
         outputSelector)

Pipe wcPipe =
new Ea...
Each


Pipe wcPipe =
new Each("wordcount",
    new Fields("line"),
    new RegexSplitGenerator(new Fields("word"), "s+"),
...
String inputPath = args[0];
        String outputPath = args[1];

        Tap sourceTap = inputPath.matches( "^[^:]+://.*"...
new Lfs(inputScheme, inputPath);
        Tap sinkTap   = outputPath.matches("^[^:]+://.*") ?
          new Hfs(outputSchem...
new Lfs(inputScheme, inputPath);
        Tap sinkTap   = outputPath.matches("^[^:]+://.*") ?
          new Hfs(outputSchem...
new Lfs(inputScheme, inputPath);
        Tap sinkTap   = outputPath.matches("^[^:]+://.*") ?
          new Hfs(outputSchem...
Every

new Every(previousPipe,
         argumentSelector,
         operation,
         outputSelector)
Every

new Every(previousPipe,
         argumentSelector,
         operation,
         outputSelector)


 new Every(wcPipe...
new Every(wcPipe, new Count(), new Fields("count", "word"));
Operations: Aggregator
Operations: Aggregator


•   Average()
Operations: Aggregator


•   Average()

•   Count()
Operations: Aggregator


•   Average()

•   Count()

•   First() / Last()
Operations: Aggregator


•   Average()

•   Count()

•   First() / Last()

•   Min() / Max()
Operations: Aggregator


•   Average()

•   Count()

•   First() / Last()

•   Min() / Max()

•   Sum()
Every

new Every(previousPipe,
         argumentSelector,
         operation,
         outputSelector)
Every

new Every(previousPipe,
         argumentSelector,
         operation,
         outputSelector)


 new Every(wcPipe...
Every

    new Every(previousPipe,
             argumentSelector,
             operation,
             outputSelector)


 ...
Field Selection
Predefined Field Sets
Predefined Field Sets
  Fields.ALL              all available fields

Fields.GROUP         fields used for last grouping

Fi...
Predefined Field Sets
  Fields.ALL              all available fields

Fields.GROUP         fields used for last grouping

Fi...
Predefined Field Sets
  Fields.ALL              all available fields

Fields.GROUP         fields used for last grouping

Fi...
Predefined Field Sets
  Fields.ALL              all available fields

Fields.GROUP         fields used for last grouping

Fi...
Predefined Field Sets
  Fields.ALL              all available fields

Fields.GROUP         fields used for last grouping

Fi...
Field Selection
new Each(previousPipe,
         argumentSelector,
         operation,
         outputSelector)

pipe = new...
Field Selection
new Each(previousPipe,
         argumentSelector,
         operation,
         outputSelector)



        ...
Field Selection


pipe = new Each(pipe,
           new Fields("timestamp"),
           new DateParser("yyyy-MM-dd HH:mm:ss...
Field Selection
     Original Tuple
      visitor_ search_    time    page_
id
         id      term    stamp   number



...
Field Selection
     Original Tuple                       Argument Selector
      visitor_ search_    time    page_
id    ...
Field Selection
     Original Tuple                       Argument Selector
      visitor_ search_    time    page_
id    ...
Field Selection
     Original Tuple                                    Argument Selector
      visitor_ search_     time  ...
Field Selection
new Each(previousPipe,
         argumentSelector,
         operation,
         outputSelector)



        ...
Field Selection


pipe = new Each(pipe,
           new Fields("timestamp"),
           new DateParser("yyyy-MM-dd HH:mm:ss...
Field Selection
     Original Tuple
      visitor_   search_    time    page_
id
         id        term    stamp   number...
Field Selection
     Original Tuple
id
      visitor_
         id
                 search_
                   term
       ...
Field Selection
     Original Tuple                         DateParser Output
id
      visitor_
         id
              ...
Field Selection
     Original Tuple                         DateParser Output   Output Selector
id
      visitor_
        ...
Field Selection
     Original Tuple                         DateParser Output   Output Selector
id
      visitor_
        ...
Field Selection
     Original Tuple                         DateParser Output   Output Selector
id
      visitor_
        ...
Field Selection
     Original Tuple                         DateParser Output              Output Selector
id
      visito...
Field Selection
     Original Tuple                         DateParser Output              Output Selector
id
      visito...
Field Selection
     Original Tuple                         DateParser Output              Output Selector
id
      visito...
Field Selection
     Original Tuple                         DateParser Output              Output Selector
id
      visito...
Field Selection
     Original Tuple                         DateParser Output             Output Selector

X
id
      visi...
word
                           hello
                            word
                           hello
                  ...
word
                             hello
                              word
                             hello
            ...
new Lfs(inputScheme, inputPath);
        Tap sinkTap   = outputPath.matches("^[^:]+://.*") ?
          new Hfs(outputSchem...
new Lfs(outputScheme, outputPath);

        Pipe wcPipe =
        new Each("wordcount",
            new Fields("line"),
  ...
new Lfs(outputScheme, outputPath);

        Pipe wcPipe =
        new Each("wordcount",
            new Fields("line"),
  ...
new Lfs(outputScheme, outputPath);

        Pipe wcPipe =
        new Each("wordcount",
            new Fields("line"),
  ...
new Lfs(outputScheme, outputPath);

        Pipe wcPipe =
        new Each("wordcount",
            new Fields("line"),
  ...
Run
input:
   mary had a little lamb
   little lamb
   little lamb
   mary had a little lamb
   whose fleece was white as snow
input:
   mary had a little lamb
   little lamb
   little lamb
   mary had a little lamb
   whose fleece was white as snow...
output:
 2    a
 1    as
 1    fleece
 2    had
 4    lamb
 4    little
 2    mary
 1    snow
 1    was
 1    white
 1    ...
What’s next?
apply operations to
  tuple streams
apply operations to
  tuple streams


      repeat.
Cookbook
Cookbook

// remove all tuples where search_term is '-'
pipe = new Each(pipe,
           new Fields("search_term"),
      ...
Cookbook

// convert "timestamp" String field to "ts" long field
pipe = new Each(pipe,
           new Fields("timestamp"),...
Cookbook
// td - timestamp of day-level granularity
pipe = new Each(pipe,
           new Fields("ts"),
           new Expr...
Cookbook
// SELECT DISTINCT visitor_id

// GROUP BY visitor_id ORDER BY ts
pipe = new GroupBy(pipe,
           new Fields(...
Custom Operations
Desired Operation


Pipe pipe = new Each("ngram pipe",
    new Fields("line"),
    new NGramTokenizer(2),
    new Fields("...
Desired Operation
                  line
        mary had a little lamb




Pipe pipe = new Each("ngram pipe",
    new Fie...
Desired Operation
                     line
           mary had a little lamb




  Pipe pipe = new Each("ngram pipe",
   ...
Desired Operation
                     line
           mary had a little lamb




  Pipe pipe = new Each("ngram pipe",
   ...
Desired Operation
                     line
           mary had a little lamb




  Pipe pipe = new Each("ngram pipe",
   ...
Desired Operation
                     line
           mary had a little lamb




  Pipe pipe = new Each("ngram pipe",
   ...
Custom Operations
Custom Operations

• extend BaseOperation
Custom Operations

• extend BaseOperation
• implement Function (or Filter)
Custom Operations

• extend BaseOperation
• implement Function (or Filter)
• implement operate
public class NGramTokenizer extends BaseOperation implements Function
{
  private int size;

    public NGramTokenizer(int...
public class NGramTokenizer extends BaseOperation implements Function
{
  private int size;

    public NGramTokenizer(int...
public class NGramTokenizer extends BaseOperation implements Function
{
  private int size;

    public NGramTokenizer(int...
public class NGramTokenizer extends BaseOperation implements Function
{
  private int size;

    public NGramTokenizer(int...
public class NGramTokenizer extends BaseOperation implements Function
{
  private int size;

    public NGramTokenizer(int...
public class NGramTokenizer extends BaseOperation implements Function
{
  private int size;

    public NGramTokenizer(int...
public class NGramTokenizer extends BaseOperation implements Function
{
  private int size;

    public NGramTokenizer(int...
public class NGramTokenizer extends BaseOperation implements Function
{
  private int size;

    public NGramTokenizer(int...
Tuple   123   vis-abc   tacos   2009-10-08
                                 03:21:23    1
Tuple   123   vis-abc   tacos   2009-10-08
                                 03:21:23    1




        zero-indexed, ordere...
Tuple      123   vis-abc   tacos    2009-10-08
                                     03:21:23      1



Fields      id
    ...
Tuple    123   vis-abc   tacos    2009-10-08
                                   03:21:23      1



Fields    id
          ...
TupleEntry
      visitor_ search_     time        page_
 id
         id      term     stamp       number


123   vis-abc  ...
public class NGramTokenizer extends BaseOperation implements Function
{
  private int size;

    public NGramTokenizer(int...
public class NGramTokenizer extends BaseOperation implements Function
{
  private int size;

    public NGramTokenizer(int...
public class NGramTokenizer extends BaseOperation implements Function
{
  private int size;

    public NGramTokenizer(int...
Success
                     line
           mary had a little lamb




  Pipe pipe = new Each("ngram pipe",
      new Fie...
code examples
dot
dot
Flow importFlow = flowConnector.connect(sourceTap, tmpIntermediateTap, importPipe);
importFlow.writeDOT( "import-flow....
What next?
What next?
What next?


• http://www.cascading.org/
What next?


• http://www.cascading.org/
• http://bit.ly/cascading
Questions?
 <nmurray@att.com>
Intro To Cascading
Intro To Cascading
Intro To Cascading
Intro To Cascading
Intro To Cascading
Upcoming SlideShare
Loading in...5
×

Intro To Cascading

11,279

Published on

A practical introduction to Chris Wensel's Cascading MapReduce/Hadoop framework.

Published in: Technology, Business
3 Comments
43 Likes
Statistics
Notes
No Downloads
Views
Total Views
11,279
On Slideshare
0
From Embeds
0
Number of Embeds
54
Actions
Shares
0
Downloads
622
Comments
3
Likes
43
Embeds 0
No embeds

No notes for slide

Transcript of "Intro To Cascading"

  1. 1. Introduction to Cascading
  2. 2. What is Cascading?
  3. 3. What is Cascading? • abstraction over MapReduce
  4. 4. What is Cascading? • abstraction over MapReduce • API for data-processing workflows
  5. 5. Why use Cascading?
  6. 6. Why use Cascading? • MapReduce can be:
  7. 7. Why use Cascading? • MapReduce can be: • the wrong level of granularity
  8. 8. Why use Cascading? • MapReduce can be: • the wrong level of granularity • cumbersome to chain together
  9. 9. Why use Cascading?
  10. 10. Why use Cascading? • Cascading helps create
  11. 11. Why use Cascading? • Cascading helps create • higher-level data processing abstractions
  12. 12. Why use Cascading? • Cascading helps create • higher-level data processing abstractions • sophisticated data pipelines
  13. 13. Why use Cascading? • Cascading helps create • higher-level data processing abstractions • sophisticated data pipelines • reusable components
  14. 14. Credits • Cascading written by Chris Wensel • based on his users guide http://bit.ly/cascading
  15. 15. package cascadingtutorial.wordcount; /** * Wordcount example in Cascading */ public class Main { public static void main( String[] args ) { String inputPath = args[0]; String outputPath = args[1]; Scheme inputScheme = new TextLine(new Fields("offset", "line")); Scheme outputScheme = new TextLine(); Tap sourceTap = inputPath.matches( "^[^:]+://.*") ? new Hfs(inputScheme, inputPath) : new Lfs(inputScheme, inputPath); Tap sinkTap = outputPath.matches("^[^:]+://.*") ? new Hfs(outputScheme, outputPath) : new Lfs(outputScheme, outputPath); Pipe wcPipe = new Each("wordcount", new Fields("line"), new RegexSplitGenerator(new Fields("word"), "s+"), new Fields("word")); wcPipe = new GroupBy(wcPipe, new Fields("word")); wcPipe = new Every(wcPipe, new Count(), new Fields("count", "word")); Properties properties = new Properties(); FlowConnector.setApplicationJarClass(properties, Main.class); Flow parsedLogFlow = new FlowConnector(properties) .connect(sourceTap, sinkTap, wcPipe); parsedLogFlow.start(); parsedLogFlow.complete(); } }
  16. 16. Data Model Pipes Filters
  17. 17. Data Model file
  18. 18. Data Model file
  19. 19. Data Model file
  20. 20. Data Model file
  21. 21. Data Model file
  22. 22. Data Model file
  23. 23. Data Model file file file
  24. 24. Data Model file file file Filters
  25. 25. Tuple
  26. 26. Pipe Assembly
  27. 27. Pipe Assembly Tuple Stream
  28. 28. Taps: sources & sinks
  29. 29. Taps: sources & sinks Taps
  30. 30. Taps: sources & sinks Taps
  31. 31. Taps: sources & sinks Taps
  32. 32. Taps: sources & sinks Taps Source
  33. 33. Taps: sources & sinks Taps Source Sink
  34. 34. Pipe + Taps = Flow Flow
  35. 35. Flow Flow Flow Flow Flow
  36. 36. n Flows Flow Flow Flow Flow Flow
  37. 37. n Flows = Cascade Flow Flow Flow Flow Flow
  38. 38. n Flows = Cascade Flow Flow Flow Flow Flow complex
  39. 39. Example
  40. 40. package cascadingtutorial.wordcount; /** * Wordcount example in Cascading */ public class Main { public static void main( String[] args ) { String inputPath = args[0]; String outputPath = args[1]; Scheme inputScheme = new TextLine(new Fields("offset", "line")); Scheme outputScheme = new TextLine(); Tap sourceTap = inputPath.matches( "^[^:]+://.*") ? new Hfs(inputScheme, inputPath) : new Lfs(inputScheme, inputPath); Tap sinkTap = outputPath.matches("^[^:]+://.*") ? new Hfs(outputScheme, outputPath) : new Lfs(outputScheme, outputPath); Pipe wcPipe = new Each("wordcount", new Fields("line"), new RegexSplitGenerator(new Fields("word"), "s+"), new Fields("word")); wcPipe = new GroupBy(wcPipe, new Fields("word")); wcPipe = new Every(wcPipe, new Count(), new Fields("count", "word")); Properties properties = new Properties(); FlowConnector.setApplicationJarClass(properties, Main.class); Flow parsedLogFlow = new FlowConnector(properties) .connect(sourceTap, sinkTap, wcPipe); parsedLogFlow.start(); parsedLogFlow.complete(); } }
  41. 41. package cascadingtutorial.wordcount; /** * Wordcount example in Cascading */ public class Main { public static void main( String[] args ) { String inputPath = args[0]; String outputPath = args[1]; Scheme inputScheme = new TextLine(new Fields("offset", "line")); Scheme outputScheme = new TextLine(); Tap sourceTap = inputPath.matches( "^[^:]+://.*") ? new Hfs(inputScheme, inputPath) : new Lfs(inputScheme, inputPath); Tap sinkTap = outputPath.matches("^[^:]+://.*") ? new Hfs(outputScheme, outputPath) : new Lfs(outputScheme, outputPath); Pipe wcPipe = new Each("wordcount", new Fields("line"), new RegexSplitGenerator(new Fields("word"), "s+"), new Fields("word")); wcPipe = new GroupBy(wcPipe, new Fields("word")); wcPipe = new Every(wcPipe, new Count(), new Fields("count", "word")); Properties properties = new Properties(); FlowConnector.setApplicationJarClass(properties, Main.class); Flow parsedLogFlow = new FlowConnector(properties) .connect(sourceTap, sinkTap, wcPipe); parsedLogFlow.start(); parsedLogFlow.complete(); } }
  42. 42. import cascading.tap.Tap; import cascading.tuple.Fields; import java.util.Properties; /** * Wordcount example in Cascading */ public class Main { public static void main( String[] args ) { Scheme inputScheme = new TextLine(new Fields("offset", "line")); Scheme outputScheme = new TextLine(); String inputPath = args[0]; String outputPath = args[1]; Tap sourceTap = inputPath.matches( "^[^:]+://.*") ? new Hfs(inputScheme, inputPath) : new Lfs(inputScheme, inputPath); Tap sinkTap = outputPath.matches("^[^:]+://.*") ? new Hfs(outputScheme, outputPath) : new Lfs(outputScheme, outputPath); Pipe wcPipe = new Each("wordcount", new Fields("line"), new RegexSplitGenerator(new Fields("word"), "s+"), new Fields("word"));
  43. 43. import cascading.tap.Tap; import cascading.tuple.Fields; import java.util.Properties; /** * Wordcount example in Cascading */ public class Main { public static void main( String[] args ) { Scheme inputScheme = new TextLine(new Fields("offset", "line")); Scheme outputScheme = new TextLine(); String inputPath = args[0]; String outputPath = args[1]; Tap sourceTap = inputPath.matches( "^[^:]+://.*") ? new Hfs(inputScheme, inputPath) : new Lfs(inputScheme, inputPath); Tap sinkTap = outputPath.matches("^[^:]+://.*") ? new Hfs(outputScheme, outputPath) : new Lfs(outputScheme, outputPath); Pipe wcPipe = new Each("wordcount", new Fields("line"), new RegexSplitGenerator(new Fields("word"), "s+"), new Fields("word"));
  44. 44. import cascading.tap.Tap; import cascading.tuple.Fields; import java.util.Properties; /** * Wordcount example in Cascading */ public class Main { public static void main( String[] args ) { Scheme inputScheme = new TextLine(new Fields("offset", "line")); Scheme outputScheme = new TextLine(); String inputPath = args[0]; String outputPath = args[1]; Tap sourceTap = inputPath.matches( "^[^:]+://.*") ? new Hfs(inputScheme, inputPath) : new Lfs(inputScheme, inputPath); TextLine() Tap sinkTap = outputPath.matches("^[^:]+://.*") ? new Hfs(outputScheme, outputPath) : new Lfs(outputScheme, outputPath); Pipe wcPipe = new Each("wordcount", new Fields("line"), new RegexSplitGenerator(new Fields("word"), "s+"), new Fields("word"));
  45. 45. import cascading.tap.Tap; import cascading.tuple.Fields; import java.util.Properties; /** * Wordcount example in Cascading */ public class Main { public static void main( String[] args ) { Scheme inputScheme = new TextLine(new Fields("offset", "line")); Scheme outputScheme = new TextLine(); String inputPath = args[0]; String outputPath = args[1]; Tap sourceTap = inputPath.matches( "^[^:]+://.*") ? new Hfs(inputScheme, inputPath) : new Lfs(inputScheme, inputPath); TextLine() Tap sinkTap SequenceFile() = outputPath.matches("^[^:]+://.*") ? new Hfs(outputScheme, outputPath) : new Lfs(outputScheme, outputPath); Pipe wcPipe = new Each("wordcount", new Fields("line"), new RegexSplitGenerator(new Fields("word"), "s+"), new Fields("word"));
  46. 46. import cascading.tap.Tap; import cascading.tuple.Fields; import java.util.Properties; /** * Wordcount example in Cascading */ public class Main { public static void main( String[] args ) { Scheme inputScheme = new TextLine(new Fields("offset", "line")); Scheme outputScheme = new TextLine(); String inputPath = args[0]; String outputPath = args[1]; Tap sourceTap = inputPath.matches( "^[^:]+://.*") ? new Hfs(inputScheme, inputPath) : new Lfs(inputScheme, inputPath); Tap sinkTap = outputPath.matches("^[^:]+://.*") ? new Hfs(outputScheme, outputPath) : new Lfs(outputScheme, outputPath); Pipe wcPipe = new Each("wordcount", new Fields("line"), new RegexSplitGenerator(new Fields("word"), "s+"), new Fields("word"));
  47. 47. /** * Wordcount example in Cascading */ public class Main { public static void main( String[] args ) { Scheme inputScheme = new TextLine(new Fields("offset", "line")); Scheme outputScheme = new TextLine(); String inputPath = args[0]; String outputPath = args[1]; Tap sourceTap = inputPath.matches( "^[^:]+://.*") ? new Hfs(inputScheme, inputPath) : new Lfs(inputScheme, inputPath); Tap sinkTap = outputPath.matches("^[^:]+://.*") ? new Hfs(outputScheme, outputPath) : new Lfs(outputScheme, outputPath); Pipe wcPipe = new Each("wordcount", new Fields("line"), new RegexSplitGenerator(new Fields("word"), "s+"), new Fields("word")); wcPipe = new GroupBy(wcPipe, new Fields("word")); wcPipe = new Every(wcPipe, new Count(), new Fields("count", "word"));
  48. 48. /** * Wordcount example in Cascading */ public class Main { hdfs://master0:54310/user/nmurray/data.txt public static void main( String[] args ) { Scheme inputScheme = new TextLine(new Fields("offset", "line")); Scheme outputScheme = new TextLine(); String inputPath = args[0]; String outputPath = args[1]; Tap sourceTap = inputPath.matches( "^[^:]+://.*") ? new Hfs(inputScheme, inputPath) : new Lfs(inputScheme, inputPath); Tap sinkTap = outputPath.matches("^[^:]+://.*") ? new Hfs(outputScheme, outputPath) : new Lfs(outputScheme, outputPath); Pipe wcPipe = new Each("wordcount", new Fields("line"), new RegexSplitGenerator(new Fields("word"), "s+"), new Fields("word")); wcPipe = new GroupBy(wcPipe, new Fields("word")); wcPipe = new Every(wcPipe, new Count(), new Fields("count", "word"));
  49. 49. /** * Wordcount example in Cascading */ public class Main { hdfs://master0:54310/user/nmurray/data.txt new Hfs() public static void main( String[] args ) { Scheme inputScheme = new TextLine(new Fields("offset", "line")); Scheme outputScheme = new TextLine(); String inputPath = args[0]; String outputPath = args[1]; Tap sourceTap = inputPath.matches( "^[^:]+://.*") ? new Hfs(inputScheme, inputPath) : new Lfs(inputScheme, inputPath); Tap sinkTap = outputPath.matches("^[^:]+://.*") ? new Hfs(outputScheme, outputPath) : new Lfs(outputScheme, outputPath); Pipe wcPipe = new Each("wordcount", new Fields("line"), new RegexSplitGenerator(new Fields("word"), "s+"), new Fields("word")); wcPipe = new GroupBy(wcPipe, new Fields("word")); wcPipe = new Every(wcPipe, new Count(), new Fields("count", "word"));
  50. 50. /** * Wordcount example in Cascading */ public class Main { hdfs://master0:54310/user/nmurray/data.txt new Hfs() data/sources/obama-inaugural-address.txt public static void main( String[] args ) { Scheme inputScheme = new TextLine(new Fields("offset", "line")); Scheme outputScheme = new TextLine(); String inputPath = args[0]; String outputPath = args[1]; Tap sourceTap = inputPath.matches( "^[^:]+://.*") ? new Hfs(inputScheme, inputPath) : new Lfs(inputScheme, inputPath); Tap sinkTap = outputPath.matches("^[^:]+://.*") ? new Hfs(outputScheme, outputPath) : new Lfs(outputScheme, outputPath); Pipe wcPipe = new Each("wordcount", new Fields("line"), new RegexSplitGenerator(new Fields("word"), "s+"), new Fields("word")); wcPipe = new GroupBy(wcPipe, new Fields("word")); wcPipe = new Every(wcPipe, new Count(), new Fields("count", "word"));
  51. 51. /** * Wordcount example in Cascading */ public class Main { hdfs://master0:54310/user/nmurray/data.txt new Hfs() data/sources/obama-inaugural-address.txt public static void main( String[] args ) new Lfs() { Scheme inputScheme = new TextLine(new Fields("offset", "line")); Scheme outputScheme = new TextLine(); String inputPath = args[0]; String outputPath = args[1]; Tap sourceTap = inputPath.matches( "^[^:]+://.*") ? new Hfs(inputScheme, inputPath) : new Lfs(inputScheme, inputPath); Tap sinkTap = outputPath.matches("^[^:]+://.*") ? new Hfs(outputScheme, outputPath) : new Lfs(outputScheme, outputPath); Pipe wcPipe = new Each("wordcount", new Fields("line"), new RegexSplitGenerator(new Fields("word"), "s+"), new Fields("word")); wcPipe = new GroupBy(wcPipe, new Fields("word")); wcPipe = new Every(wcPipe, new Count(), new Fields("count", "word"));
  52. 52. /** * Wordcount example in Cascading */ public class Main { hdfs://master0:54310/user/nmurray/data.txt new Hfs() data/sources/obama-inaugural-address.txt public static void main( String[] args ) new Lfs() { Scheme inputScheme = new TextLine(new Fields("offset", "line")); Scheme outputScheme = new TextLine(); String inputPath = args[0]; String outputPath = args[1]; Tap sourceTap = inputPath.matches( "^[^:]+://.*") ? new Hfs(inputScheme, inputPath) : new Lfs(inputScheme, inputPath); Tap sinkTap = outputPath.matches("^[^:]+://.*") ? new Hfs(outputScheme, outputPath) : new Lfs(outputScheme, outputPath); Pipe wcPipe = new Each("wordcount", new Fields("line"), S3fs() new RegexSplitGenerator(new Fields("word"), "s+"), new Fields("word")); wcPipe = new GroupBy(wcPipe, new Fields("word")); wcPipe = new Every(wcPipe, new Count(), new Fields("count", "word"));
  53. 53. /** * Wordcount example in Cascading */ public class Main { hdfs://master0:54310/user/nmurray/data.txt new Hfs() data/sources/obama-inaugural-address.txt public static void main( String[] args ) new Lfs() { Scheme inputScheme = new TextLine(new Fields("offset", "line")); Scheme outputScheme = new TextLine(); String inputPath = args[0]; String outputPath = args[1]; Tap sourceTap = inputPath.matches( "^[^:]+://.*") ? new Hfs(inputScheme, inputPath) : new Lfs(inputScheme, inputPath); Tap sinkTap = outputPath.matches("^[^:]+://.*") ? new Hfs(outputScheme, outputPath) : new Lfs(outputScheme, outputPath); Pipe wcPipe = new Each("wordcount", new Fields("line"), S3fs() new RegexSplitGenerator(new Fields("word"), "s+"), new Fields("word")); GlobHfs() new Fields("count", "word")); wcPipe = new GroupBy(wcPipe, new Fields("word")); wcPipe = new Every(wcPipe, new Count(),
  54. 54. /** * Wordcount example in Cascading */ public class Main { public static void main( String[] args ) { Scheme inputScheme = new TextLine(new Fields("offset", "line")); Scheme outputScheme = new TextLine(); String inputPath = args[0]; String outputPath = args[1]; Tap sourceTap = inputPath.matches( "^[^:]+://.*") ? new Hfs(inputScheme, inputPath) : new Lfs(inputScheme, inputPath); Tap sinkTap = outputPath.matches("^[^:]+://.*") ? new Hfs(outputScheme, outputPath) : new Lfs(outputScheme, outputPath); Pipe wcPipe = new Each("wordcount", new Fields("line"), new RegexSplitGenerator(new Fields("word"), "s+"), new Fields("word")); wcPipe = new GroupBy(wcPipe, new Fields("word")); wcPipe = new Every(wcPipe, new Count(), new Fields("count", "word"));
  55. 55. Scheme outputScheme = new TextLine(); String inputPath = args[0]; String outputPath = args[1]; Tap sourceTap = inputPath.matches( "^[^:]+://.*") ? new Hfs(inputScheme, inputPath) : new Lfs(inputScheme, inputPath); Tap sinkTap = outputPath.matches("^[^:]+://.*") ? new Hfs(outputScheme, outputPath) : new Lfs(outputScheme, outputPath); Pipe wcPipe = new Each("wordcount", new Fields("line"), new RegexSplitGenerator(new Fields("word"), "s+"), new Fields("word")); wcPipe = new GroupBy(wcPipe, new Fields("word")); wcPipe = new Every(wcPipe, new Count(), new Fields("count", "word")); Properties properties = new Properties(); FlowConnector.setApplicationJarClass(properties, Main.class); Flow parsedLogFlow = new FlowConnector(properties) .connect(sourceTap, sinkTap, wcPipe); parsedLogFlow.start(); parsedLogFlow.complete(); } }
  56. 56. Scheme outputScheme = new TextLine(); String inputPath = args[0]; String outputPath = args[1]; Pipe Assembly Tap sourceTap = inputPath.matches( "^[^:]+://.*") ? new Hfs(inputScheme, inputPath) : new Lfs(inputScheme, inputPath); Tap sinkTap = outputPath.matches("^[^:]+://.*") ? new Hfs(outputScheme, outputPath) : new Lfs(outputScheme, outputPath); Pipe wcPipe = new Each("wordcount", new Fields("line"), new RegexSplitGenerator(new Fields("word"), "s+"), new Fields("word")); wcPipe = new GroupBy(wcPipe, new Fields("word")); wcPipe = new Every(wcPipe, new Count(), new Fields("count", "word")); Properties properties = new Properties(); FlowConnector.setApplicationJarClass(properties, Main.class); Flow parsedLogFlow = new FlowConnector(properties) .connect(sourceTap, sinkTap, wcPipe); parsedLogFlow.start(); parsedLogFlow.complete(); } }
  57. 57. Pipe Assemblies
  58. 58. Pipe Assemblies
  59. 59. Pipe Assemblies • Define work against a Tuple Stream
  60. 60. Pipe Assemblies • Define work against a Tuple Stream • May have multiple sources and sinks
  61. 61. Pipe Assemblies • Define work against a Tuple Stream • May have multiple sources and sinks • Splits
  62. 62. Pipe Assemblies • Define work against a Tuple Stream • May have multiple sources and sinks • Splits • Merges
  63. 63. Pipe Assemblies • Define work against a Tuple Stream • May have multiple sources and sinks • Splits • Merges • Joins
  64. 64. Pipe Assemblies
  65. 65. Pipe Assemblies • Pipe
  66. 66. Pipe Assemblies • Pipe • Each • GroupBy • CoGroup • Every • SubAssembly
  67. 67. Pipe Assemblies • Pipe • Each • GroupBy • CoGroup • Every • SubAssembly
  68. 68. Pipe Assemblies • Pipe Applies a • Each Function or Filter • GroupBy Operation to each • CoGroup Tuple • Every • SubAssembly
  69. 69. Pipe Assemblies • Pipe • Each • GroupBy • CoGroup • Every • SubAssembly
  70. 70. Pipe Assemblies • Pipe • Each • GroupBy • CoGroup • Every • SubAssembly
  71. 71. Pipe Assemblies • Pipe • Each Group • GroupBy (& merge) • CoGroup • Every • SubAssembly
  72. 72. Pipe Assemblies • Pipe • Each • GroupBy Joins • CoGroup (inner, outer, left, right) • Every • SubAssembly
  73. 73. Pipe Assemblies • Pipe • Each • GroupBy • CoGroup • Every • SubAssembly
  74. 74. Pipe Assemblies • Pipe • Each • GroupBy • CoGroup • Every Applies an Aggregator (count, sum) to every group of Tuples. • SubAssembly
  75. 75. Pipe Assemblies • Pipe • Each • GroupBy • CoGroup • Every • SubAssembly Nesting
  76. 76. Each vs. Every
  77. 77. Each vs. Every • Each() is for individual Tuples
  78. 78. Each vs. Every • Each() is for individual Tuples • Every() is for groups of Tuples
  79. 79. Each new Each(previousPipe, argumentSelector, operation, outputSelector)
  80. 80. Each new Each(previousPipe, argumentSelector, operation, outputSelector) Pipe wcPipe = new Each("wordcount", new Fields("line"), new RegexSplitGenerator(new Fields("word"), "s+"), new Fields("word"));
  81. 81. Each new Each(previousPipe, argumentSelector, operation, outputSelector) Pipe wcPipe = new Each("wordcount", new Fields("line"), new RegexSplitGenerator(new Fields("word"), "s+"), new Fields("word"));
  82. 82. Each new Each(previousPipe, argumentSelector, operation, outputSelector) Pipe wcPipe = new Each("wordcount", new Fields("line"), new RegexSplitGenerator(new Fields("word"), "s+"), new Fields("word"));
  83. 83. import cascading.tap.Tap; import cascading.tuple.Fields; import java.util.Properties; /** * Wordcount example in Cascading */ public class Main { public static void main( String[] args ) { Scheme inputScheme = new TextLine(new Fields("offset", "line")); Scheme outputScheme = new TextLine(); String inputPath = args[0]; String outputPath = args[1]; Tap sourceTap = inputPath.matches( "^[^:]+://.*") ? new Hfs(inputScheme, inputPath) : new Lfs(inputScheme, inputPath); Tap sinkTap = outputPath.matches("^[^:]+://.*") ? new Hfs(outputScheme, outputPath) : new Lfs(outputScheme, outputPath); Pipe wcPipe = new Each("wordcount", new Fields("line"), new RegexSplitGenerator(new Fields("word"), "s+"), new Fields("word"));
  84. 84. Each new Each(previousPipe, argumentSelector, operation, outputSelector) Pipe wcPipe = new Each("wordcount", new Fields("line"), new RegexSplitGenerator(new Fields("word"), "s+"), new Fields("word"));
  85. 85. Each new Each(previousPipe, argumentSelector, operation, outputSelector) Pipe wcPipe = new Each("wordcount", new Fields("line"), new RegexSplitGenerator(new Fields("word"), "s+"), new Fields("word"));
  86. 86. Each new Each(previousPipe, argumentSelector, operation, outputSelector) Pipe wcPipe = new Each("wordcount", new Fields("line"), new RegexSplitGenerator(new Fields("word"), "s+"), new Fields("word"));
  87. 87. Each new Each(previousPipe, argumentSelector, operation, outputSelector) Pipe wcPipe = new Each("wordcount", new Fields("line"), new RegexSplitGenerator(new Fields("word"), "s+"), new Fields("word")); Fields()
  88. 88. Each offset line 0 the lazy brown fox Pipe wcPipe = new Each("wordcount", new Fields("line"), new RegexSplitGenerator(new Fields("word"), "s+"), new Fields("word"));
  89. 89. Each offset line 0 the lazy brown fox Pipe wcPipe = new Each("wordcount", new Fields("line"), new RegexSplitGenerator(new Fields("word"), "s+"), new Fields("word"));
  90. 90. Each X offset 0 line the lazy brown fox Pipe wcPipe = new Each("wordcount", new Fields("line"), new RegexSplitGenerator(new Fields("word"), "s+"), new Fields("word"));
  91. 91. Each X offset 0 line the lazy brown fox Pipe wcPipe = new Each("wordcount", new Fields("line"), new RegexSplitGenerator(new Fields("word"), "s+"), new Fields("word")); word word word word the lazy brown fox
  92. 92. Operations: Functions
  93. 93. Operations: Functions • Insert()
  94. 94. Operations: Functions • Insert() • RegexParser() / RegexGenerator() / RegexReplace()
  95. 95. Operations: Functions • Insert() • RegexParser() / RegexGenerator() / RegexReplace() • DateParser() / DateFormatter()
  96. 96. Operations: Functions • Insert() • RegexParser() / RegexGenerator() / RegexReplace() • DateParser() / DateFormatter() • XPathGenerator()
  97. 97. Operations: Functions • Insert() • RegexParser() / RegexGenerator() / RegexReplace() • DateParser() / DateFormatter() • XPathGenerator() • Identity()
  98. 98. Operations: Functions • Insert() • RegexParser() / RegexGenerator() / RegexReplace() • DateParser() / DateFormatter() • XPathGenerator() • Identity() • etc...
  99. 99. Operations: Filters
  100. 100. Operations: Filters • RegexFilter()
  101. 101. Operations: Filters • RegexFilter() • FilterNull()
  102. 102. Operations: Filters • RegexFilter() • FilterNull() • And() / Or() / Not() / Xor()
  103. 103. Operations: Filters • RegexFilter() • FilterNull() • And() / Or() / Not() / Xor() • ExpressionFilter()
  104. 104. Operations: Filters • RegexFilter() • FilterNull() • And() / Or() / Not() / Xor() • ExpressionFilter() • etc...
  105. 105. Each new Each(previousPipe, argumentSelector, operation, outputSelector) Pipe wcPipe = new Each("wordcount", new Fields("line"), new RegexSplitGenerator(new Fields("word"), "s+"), new Fields("word"));
  106. 106. Each Pipe wcPipe = new Each("wordcount", new Fields("line"), new RegexSplitGenerator(new Fields("word"), "s+"), new Fields("word"));
  107. 107. String inputPath = args[0]; String outputPath = args[1]; Tap sourceTap = inputPath.matches( "^[^:]+://.*") ? new Hfs(inputScheme, inputPath) : new Lfs(inputScheme, inputPath); Tap sinkTap = outputPath.matches("^[^:]+://.*") ? new Hfs(outputScheme, outputPath) : new Lfs(outputScheme, outputPath); Pipe wcPipe = new Each("wordcount", new Fields("line"), new RegexSplitGenerator(new Fields("word"), "s+"), new Fields("word")); wcPipe = new GroupBy(wcPipe, new Fields("word")); wcPipe = new Every(wcPipe, new Count(), new Fields("count", "word")); Properties properties = new Properties(); FlowConnector.setApplicationJarClass(properties, Main.class); Flow parsedLogFlow = new FlowConnector(properties) .connect(sourceTap, sinkTap, wcPipe); parsedLogFlow.start(); parsedLogFlow.complete(); } }
  108. 108. new Lfs(inputScheme, inputPath); Tap sinkTap = outputPath.matches("^[^:]+://.*") ? new Hfs(outputScheme, outputPath) : new Lfs(outputScheme, outputPath); Pipe wcPipe = new Each("wordcount", new Fields("line"), new RegexSplitGenerator(new Fields("word"), "s+"), new Fields("word")); wcPipe = new GroupBy(wcPipe, new Fields("word")); wcPipe = new Every(wcPipe, new Count(), new Fields("count", "word")); Properties properties = new Properties(); FlowConnector.setApplicationJarClass(properties, Main.class); Flow parsedLogFlow = new FlowConnector(properties) .connect(sourceTap, sinkTap, wcPipe); parsedLogFlow.start(); parsedLogFlow.complete(); } }
  109. 109. new Lfs(inputScheme, inputPath); Tap sinkTap = outputPath.matches("^[^:]+://.*") ? new Hfs(outputScheme, outputPath) : new Lfs(outputScheme, outputPath); Pipe wcPipe = new Each("wordcount", new Fields("line"), new RegexSplitGenerator(new Fields("word"), "s+"), new Fields("word")); wcPipe = new GroupBy(wcPipe, new Fields("word")); wcPipe = new Every(wcPipe, new Count(), new Fields("count", "word")); Properties properties = new Properties(); FlowConnector.setApplicationJarClass(properties, Main.class); Flow parsedLogFlow = new FlowConnector(properties) .connect(sourceTap, sinkTap, wcPipe); parsedLogFlow.start(); parsedLogFlow.complete(); } }
  110. 110. new Lfs(inputScheme, inputPath); Tap sinkTap = outputPath.matches("^[^:]+://.*") ? new Hfs(outputScheme, outputPath) : new Lfs(outputScheme, outputPath); Pipe wcPipe = new Each("wordcount", new Fields("line"), new RegexSplitGenerator(new Fields("word"), "s+"), new Fields("word")); wcPipe = new GroupBy(wcPipe, new Fields("word")); wcPipe = new Every(wcPipe, new Count(), new Fields("count", "word")); Properties properties = new Properties(); FlowConnector.setApplicationJarClass(properties, Main.class); Flow parsedLogFlow = new FlowConnector(properties) .connect(sourceTap, sinkTap, wcPipe); parsedLogFlow.start(); parsedLogFlow.complete(); } }
  111. 111. Every new Every(previousPipe, argumentSelector, operation, outputSelector)
  112. 112. Every new Every(previousPipe, argumentSelector, operation, outputSelector) new Every(wcPipe, new Count(), new Fields("count", "word"));
  113. 113. new Every(wcPipe, new Count(), new Fields("count", "word"));
  114. 114. Operations: Aggregator
  115. 115. Operations: Aggregator • Average()
  116. 116. Operations: Aggregator • Average() • Count()
  117. 117. Operations: Aggregator • Average() • Count() • First() / Last()
  118. 118. Operations: Aggregator • Average() • Count() • First() / Last() • Min() / Max()
  119. 119. Operations: Aggregator • Average() • Count() • First() / Last() • Min() / Max() • Sum()
  120. 120. Every new Every(previousPipe, argumentSelector, operation, outputSelector)
  121. 121. Every new Every(previousPipe, argumentSelector, operation, outputSelector) new Every(wcPipe, new Count(), new Fields("count", "word"));
  122. 122. Every new Every(previousPipe, argumentSelector, operation, outputSelector) new Every(wcPipe, new Count(), new Fields("count", "word")); new Every(wcPipe, Fields.ALL, new Count(), new Fields("count", "word"));
  123. 123. Field Selection
  124. 124. Predefined Field Sets
  125. 125. Predefined Field Sets Fields.ALL all available fields Fields.GROUP fields used for last grouping Fields.VALUES fields not used for last grouping fields of argument Tuple Fields.ARGS (for Operations) replaces input with Operation result Fields.RESULTS (for Pipes)
  126. 126. Predefined Field Sets Fields.ALL all available fields Fields.GROUP fields used for last grouping Fields.VALUES fields not used for last grouping fields of argument Tuple Fields.ARGS (for Operations) replaces input with Operation result Fields.RESULTS (for Pipes)
  127. 127. Predefined Field Sets Fields.ALL all available fields Fields.GROUP fields used for last grouping Fields.VALUES fields not used for last grouping fields of argument Tuple Fields.ARGS (for Operations) replaces input with Operation result Fields.RESULTS (for Pipes)
  128. 128. Predefined Field Sets Fields.ALL all available fields Fields.GROUP fields used for last grouping Fields.VALUES fields not used for last grouping fields of argument Tuple Fields.ARGS (for Operations) replaces input with Operation result Fields.RESULTS (for Pipes)
  129. 129. Predefined Field Sets Fields.ALL all available fields Fields.GROUP fields used for last grouping Fields.VALUES fields not used for last grouping fields of argument Tuple Fields.ARGS (for Operations) replaces input with Operation result Fields.RESULTS (for Pipes)
  130. 130. Field Selection new Each(previousPipe, argumentSelector, operation, outputSelector) pipe = new Each(pipe, new Fields("timestamp"), new DateParser("yyyy-MM-dd HH:mm:ss"), new Fields("timestamp"));
  131. 131. Field Selection new Each(previousPipe, argumentSelector, operation, outputSelector) Operation Input Tuple = Original Tuple Argument Selector
  132. 132. Field Selection pipe = new Each(pipe, new Fields("timestamp"), new DateParser("yyyy-MM-dd HH:mm:ss"), new Fields("ts", "search_term", "visitor_id"));
  133. 133. Field Selection Original Tuple visitor_ search_ time page_ id id term stamp number pipe = new Each(pipe, new Fields("timestamp"), new DateParser("yyyy-MM-dd HH:mm:ss"), new Fields("ts", "search_term", "visitor_id"));
  134. 134. Field Selection Original Tuple Argument Selector visitor_ search_ time page_ id timestamp id term stamp number pipe = new Each(pipe, new Fields("timestamp"), new DateParser("yyyy-MM-dd HH:mm:ss"), new Fields("ts", "search_term", "visitor_id"));
  135. 135. Field Selection Original Tuple Argument Selector visitor_ search_ time page_ id timestamp id term stamp number pipe = new Each(pipe, new Fields("timestamp"), new DateParser("yyyy-MM-dd HH:mm:ss"), new Fields("ts", "search_term", "visitor_id"));
  136. 136. Field Selection Original Tuple Argument Selector visitor_ search_ time page_ id timestamp id term stamp number pipe = new Each(pipe, new Fields("timestamp"), new DateParser("yyyy-MM-dd HH:mm:ss"), new Fields("ts", "search_term", "visitor_id")); Input to DateParser will be: timestamp
  137. 137. Field Selection new Each(previousPipe, argumentSelector, operation, outputSelector) Output Tuple = Original Tuple ⊕ Operation Tuple Output Selector
  138. 138. Field Selection pipe = new Each(pipe, new Fields("timestamp"), new DateParser("yyyy-MM-dd HH:mm:ss"), new Fields("ts", "search_term", "visitor_id"));
  139. 139. Field Selection Original Tuple visitor_ search_ time page_ id id term stamp number pipe = new Each(pipe, new Fields("timestamp"), new DateParser("yyyy-MM-dd HH:mm:ss"), new Fields("ts", "search_term", "visitor_id"));
  140. 140. Field Selection Original Tuple id visitor_ id search_ term time stamp page_ number ⊕ pipe = new Each(pipe, new Fields("timestamp"), new DateParser("yyyy-MM-dd HH:mm:ss"), new Fields("ts", "search_term", "visitor_id"));
  141. 141. Field Selection Original Tuple DateParser Output id visitor_ id search_ term time stamp page_ number ⊕ ts pipe = new Each(pipe, new Fields("timestamp"), new DateParser("yyyy-MM-dd HH:mm:ss"), new Fields("ts", "search_term", "visitor_id"));
  142. 142. Field Selection Original Tuple DateParser Output Output Selector id visitor_ id search_ term time stamp page_ number ⊕ ts ts search_ term visitor_ id pipe = new Each(pipe, new Fields("timestamp"), new DateParser("yyyy-MM-dd HH:mm:ss"), new Fields("ts", "search_term", "visitor_id"));
  143. 143. Field Selection Original Tuple DateParser Output Output Selector id visitor_ id search_ term time stamp page_ number ⊕ ts ts search_ term visitor_ id pipe = new Each(pipe, new Fields("timestamp"), new DateParser("yyyy-MM-dd HH:mm:ss"), new Fields("ts", "search_term", "visitor_id"));
  144. 144. Field Selection Original Tuple DateParser Output Output Selector id visitor_ id search_ term time stamp page_ number ⊕ ts ts search_ term visitor_ id pipe = new Each(pipe, new Fields("timestamp"), new DateParser("yyyy-MM-dd HH:mm:ss"), new Fields("ts", "search_term", "visitor_id"));
  145. 145. Field Selection Original Tuple DateParser Output Output Selector id visitor_ id search_ term time stamp page_ number ⊕ ts ts search_ term visitor_ id pipe = new Each(pipe, new Fields("timestamp"), new DateParser("yyyy-MM-dd HH:mm:ss"), new Fields("ts", "search_term", "visitor_id")); Output of Each will be: search_ visitor_ ts term id
  146. 146. Field Selection Original Tuple DateParser Output Output Selector id visitor_ id search_ term time stamp page_ number ⊕ ts ts search_ term visitor_ id pipe = new Each(pipe, new Fields("timestamp"), new DateParser("yyyy-MM-dd HH:mm:ss"), new Fields("ts", "search_term", "visitor_id")); Output of Each will be: search_ visitor_ ts term id
  147. 147. Field Selection Original Tuple DateParser Output Output Selector id visitor_ id search_ term time stamp page_ number ⊕ ts ts search_ term visitor_ id pipe = new Each(pipe, new Fields("timestamp"), new DateParser("yyyy-MM-dd HH:mm:ss"), new Fields("ts", "search_term", "visitor_id")); Output of Each will be: search_ visitor_ ts term id
  148. 148. Field Selection Original Tuple DateParser Output Output Selector id visitor_ id search_ term time stamp page_ number ⊕ ts ts search_ term visitor_ id pipe = new Each(pipe, new Fields("timestamp"), new DateParser("yyyy-MM-dd HH:mm:ss"), new Fields("ts", "search_term", "visitor_id")); Output of Each will be: search_ visitor_ ts term id
  149. 149. Field Selection Original Tuple DateParser Output Output Selector X id visitor_ id search_ term XX ⊕ time stamp page_ number ts ts search_ term visitor_ id pipe = new Each(pipe, new Fields("timestamp"), new DateParser("yyyy-MM-dd HH:mm:ss"), new Fields("ts", "search_term", "visitor_id")); Output of Each will be: search_ visitor_ ts term id
  150. 150. word hello word hello word world new Every(wcPipe, new Count(), new Fields("count", "word"));
  151. 151. word hello word hello word world new Every(wcPipe, new Count(), new Fields("count", "word")); count word 2 hello count word 1 world
  152. 152. new Lfs(inputScheme, inputPath); Tap sinkTap = outputPath.matches("^[^:]+://.*") ? new Hfs(outputScheme, outputPath) : new Lfs(outputScheme, outputPath); Pipe wcPipe = new Each("wordcount", new Fields("line"), new RegexSplitGenerator(new Fields("word"), "s+"), new Fields("word")); wcPipe = new GroupBy(wcPipe, new Fields("word")); wcPipe = new Every(wcPipe, new Count(), new Fields("count", "word")); Properties properties = new Properties(); FlowConnector.setApplicationJarClass(properties, Main.class); Flow parsedLogFlow = new FlowConnector(properties) .connect(sourceTap, sinkTap, wcPipe); parsedLogFlow.start(); parsedLogFlow.complete(); } }
  153. 153. new Lfs(outputScheme, outputPath); Pipe wcPipe = new Each("wordcount", new Fields("line"), new RegexSplitGenerator(new Fields("word"), "s+"), new Fields("word")); wcPipe = new GroupBy(wcPipe, new Fields("word")); wcPipe = new Every(wcPipe, new Count(), new Fields("count", "word")); Properties properties = new Properties(); FlowConnector.setApplicationJarClass(properties, Main.class); Flow parsedLogFlow = new FlowConnector(properties) .connect(sourceTap, sinkTap, wcPipe); parsedLogFlow.start(); parsedLogFlow.complete(); } }
  154. 154. new Lfs(outputScheme, outputPath); Pipe wcPipe = new Each("wordcount", new Fields("line"), new RegexSplitGenerator(new Fields("word"), "s+"), new Fields("word")); wcPipe = new GroupBy(wcPipe, new Fields("word")); wcPipe = new Every(wcPipe, new Count(), new Fields("count", "word")); Properties properties = new Properties(); FlowConnector.setApplicationJarClass(properties, Main.class); Flow parsedLogFlow = new FlowConnector(properties) .connect(sourceTap, sinkTap, wcPipe); parsedLogFlow.start(); parsedLogFlow.complete(); } }
  155. 155. new Lfs(outputScheme, outputPath); Pipe wcPipe = new Each("wordcount", new Fields("line"), new RegexSplitGenerator(new Fields("word"), "s+"), new Fields("word")); wcPipe = new GroupBy(wcPipe, new Fields("word")); wcPipe = new Every(wcPipe, new Count(), new Fields("count", "word")); Properties properties = new Properties(); FlowConnector.setApplicationJarClass(properties, Main.class); Flow parsedLogFlow = new FlowConnector(properties) .connect(sourceTap, sinkTap, wcPipe); parsedLogFlow.start(); parsedLogFlow.complete(); } }
  156. 156. new Lfs(outputScheme, outputPath); Pipe wcPipe = new Each("wordcount", new Fields("line"), new RegexSplitGenerator(new Fields("word"), "s+"), new Fields("word")); wcPipe = new GroupBy(wcPipe, new Fields("word")); wcPipe = new Every(wcPipe, new Count(), new Fields("count", "word")); Properties properties = new Properties(); FlowConnector.setApplicationJarClass(properties, Main.class); Flow parsedLogFlow = new FlowConnector(properties) .connect(sourceTap, sinkTap, wcPipe); parsedLogFlow.start(); parsedLogFlow.complete(); } }
  157. 157. Run
  158. 158. input: mary had a little lamb little lamb little lamb mary had a little lamb whose fleece was white as snow
  159. 159. input: mary had a little lamb little lamb little lamb mary had a little lamb whose fleece was white as snow command: hadoop jar ./target/cascading-tutorial-1.0.0.jar data/sources/misc/mary.txt data/output
  160. 160. output: 2 a 1 as 1 fleece 2 had 4 lamb 4 little 2 mary 1 snow 1 was 1 white 1 whose
  161. 161. What’s next?
  162. 162. apply operations to tuple streams
  163. 163. apply operations to tuple streams repeat.
  164. 164. Cookbook
  165. 165. Cookbook // remove all tuples where search_term is '-' pipe = new Each(pipe, new Fields("search_term"), new RegexFilter("^(-)$", true));
  166. 166. Cookbook // convert "timestamp" String field to "ts" long field pipe = new Each(pipe, new Fields("timestamp"), new DateParser("yyyy-MM-dd HH:mm:ss"), Fields.ALL);
  167. 167. Cookbook // td - timestamp of day-level granularity pipe = new Each(pipe, new Fields("ts"), new ExpressionFunction( new Fields("td"), "ts - (ts % (24 * 60 * 60 * 1000))", long.class), Fields.ALL);
  168. 168. Cookbook // SELECT DISTINCT visitor_id // GROUP BY visitor_id ORDER BY ts pipe = new GroupBy(pipe, new Fields("visitor_id"), new Fields("ts")); // take the First() tuple of every grouping pipe = new Every(pipe, Fields.ALL, new First(), Fields.RESULTS);
  169. 169. Custom Operations
  170. 170. Desired Operation Pipe pipe = new Each("ngram pipe", new Fields("line"), new NGramTokenizer(2), new Fields("ngram"));
  171. 171. Desired Operation line mary had a little lamb Pipe pipe = new Each("ngram pipe", new Fields("line"), new NGramTokenizer(2), new Fields("ngram"));
  172. 172. Desired Operation line mary had a little lamb Pipe pipe = new Each("ngram pipe", new Fields("line"), new NGramTokenizer(2), new Fields("ngram")); ngram mary had
  173. 173. Desired Operation line mary had a little lamb Pipe pipe = new Each("ngram pipe", new Fields("line"), new NGramTokenizer(2), new Fields("ngram")); ngram ngram mary had had a
  174. 174. Desired Operation line mary had a little lamb Pipe pipe = new Each("ngram pipe", new Fields("line"), new NGramTokenizer(2), new Fields("ngram")); ngram ngram ngram mary had had a a little
  175. 175. Desired Operation line mary had a little lamb Pipe pipe = new Each("ngram pipe", new Fields("line"), new NGramTokenizer(2), new Fields("ngram")); ngram ngram ngram ngram mary had had a a little little lamb
  176. 176. Custom Operations
  177. 177. Custom Operations • extend BaseOperation
  178. 178. Custom Operations • extend BaseOperation • implement Function (or Filter)
  179. 179. Custom Operations • extend BaseOperation • implement Function (or Filter) • implement operate
  180. 180. public class NGramTokenizer extends BaseOperation implements Function { private int size; public NGramTokenizer(int size) { super(new Fields("ngram")); this.size = size; } public void operate(FlowProcess flowProcess, FunctionCall functionCall) { String token; TupleEntry arguments = functionCall.getArguments(); com.attinteractive.util.NGramTokenizer t = new com.attinteractive.util.NGramTokenizer(arguments.getString(0), this.size); while((token = t.next()) != null) { TupleEntry result = new TupleEntry(new Fields("ngram"), new Tuple(token)); functionCall.getOutputCollector().add(result); } } }
  181. 181. public class NGramTokenizer extends BaseOperation implements Function { private int size; public NGramTokenizer(int size) { super(new Fields("ngram")); this.size = size; } public void operate(FlowProcess flowProcess, FunctionCall functionCall) { String token; TupleEntry arguments = functionCall.getArguments(); com.attinteractive.util.NGramTokenizer t = new com.attinteractive.util.NGramTokenizer(arguments.getString(0), this.size); while((token = t.next()) != null) { TupleEntry result = new TupleEntry(new Fields("ngram"), new Tuple(token)); functionCall.getOutputCollector().add(result); } } }
  182. 182. public class NGramTokenizer extends BaseOperation implements Function { private int size; public NGramTokenizer(int size) { super(new Fields("ngram")); this.size = size; } public void operate(FlowProcess flowProcess, FunctionCall functionCall) { String token; TupleEntry arguments = functionCall.getArguments(); com.attinteractive.util.NGramTokenizer t = new com.attinteractive.util.NGramTokenizer(arguments.getString(0), this.size); while((token = t.next()) != null) { TupleEntry result = new TupleEntry(new Fields("ngram"), new Tuple(token)); functionCall.getOutputCollector().add(result); } } }
  183. 183. public class NGramTokenizer extends BaseOperation implements Function { private int size; public NGramTokenizer(int size) { super(new Fields("ngram")); this.size = size; } public void operate(FlowProcess flowProcess, FunctionCall functionCall) { String token; TupleEntry arguments = functionCall.getArguments(); com.attinteractive.util.NGramTokenizer t = new com.attinteractive.util.NGramTokenizer(arguments.getString(0), this.size); while((token = t.next()) != null) { TupleEntry result = new TupleEntry(new Fields("ngram"), new Tuple(token)); functionCall.getOutputCollector().add(result); } } }
  184. 184. public class NGramTokenizer extends BaseOperation implements Function { private int size; public NGramTokenizer(int size) { super(new Fields("ngram")); this.size = size; } public void operate(FlowProcess flowProcess, FunctionCall functionCall) { String token; TupleEntry arguments = functionCall.getArguments(); com.attinteractive.util.NGramTokenizer t = new com.attinteractive.util.NGramTokenizer(arguments.getString(0), this.size); while((token = t.next()) != null) { TupleEntry result = new TupleEntry(new Fields("ngram"), new Tuple(token)); functionCall.getOutputCollector().add(result); } } }
  185. 185. public class NGramTokenizer extends BaseOperation implements Function { private int size; public NGramTokenizer(int size) { super(new Fields("ngram")); this.size = size; } public void operate(FlowProcess flowProcess, FunctionCall functionCall) { String token; TupleEntry arguments = functionCall.getArguments(); com.attinteractive.util.NGramTokenizer t = new com.attinteractive.util.NGramTokenizer(arguments.getString(0), this.size); while((token = t.next()) != null) { TupleEntry result = new TupleEntry(new Fields("ngram"), new Tuple(token)); functionCall.getOutputCollector().add(result); } } }
  186. 186. public class NGramTokenizer extends BaseOperation implements Function { private int size; public NGramTokenizer(int size) { super(new Fields("ngram")); this.size = size; } public void operate(FlowProcess flowProcess, FunctionCall functionCall) { String token; TupleEntry arguments = functionCall.getArguments(); com.attinteractive.util.NGramTokenizer t = new com.attinteractive.util.NGramTokenizer(arguments.getString(0), this.size); while((token = t.next()) != null) { TupleEntry result = new TupleEntry(new Fields("ngram"), new Tuple(token)); functionCall.getOutputCollector().add(result); } } }
  187. 187. public class NGramTokenizer extends BaseOperation implements Function { private int size; public NGramTokenizer(int size) { super(new Fields("ngram")); this.size = size; } public void operate(FlowProcess flowProcess, FunctionCall functionCall) { String token; TupleEntry arguments = functionCall.getArguments(); com.attinteractive.util.NGramTokenizer t = new com.attinteractive.util.NGramTokenizer(arguments.getString(0), this.size); while((token = t.next()) != null) { TupleEntry result = new TupleEntry(new Fields("ngram"), new Tuple(token)); functionCall.getOutputCollector().add(result); } } }
  188. 188. Tuple 123 vis-abc tacos 2009-10-08 03:21:23 1
  189. 189. Tuple 123 vis-abc tacos 2009-10-08 03:21:23 1 zero-indexed, ordered values
  190. 190. Tuple 123 vis-abc tacos 2009-10-08 03:21:23 1 Fields id visitor_ search_ id term time stamp page_ number list of strings representing field names
  191. 191. Tuple 123 vis-abc tacos 2009-10-08 03:21:23 1 Fields id visitor_ search_ id term time stamp page_ number
  192. 192. TupleEntry visitor_ search_ time page_ id id term stamp number 123 vis-abc tacos 2009-10-08 03:21:23 1
  193. 193. public class NGramTokenizer extends BaseOperation implements Function { private int size; public NGramTokenizer(int size) { super(new Fields("ngram")); this.size = size; } public void operate(FlowProcess flowProcess, FunctionCall functionCall) { String token; TupleEntry arguments = functionCall.getArguments(); com.attinteractive.util.NGramTokenizer t = new com.attinteractive.util.NGramTokenizer(arguments.getString(0), this.size); while((token = t.next()) != null) { TupleEntry result = new TupleEntry(new Fields("ngram"), new Tuple(token)); functionCall.getOutputCollector().add(result); } } }
  194. 194. public class NGramTokenizer extends BaseOperation implements Function { private int size; public NGramTokenizer(int size) { super(new Fields("ngram")); this.size = size; } public void operate(FlowProcess flowProcess, FunctionCall functionCall) { String token; TupleEntry arguments = functionCall.getArguments(); com.attinteractive.util.NGramTokenizer t = new com.attinteractive.util.NGramTokenizer(arguments.getString(0), this.size); while((token = t.next()) != null) { TupleEntry result = new TupleEntry(new Fields("ngram"), new Tuple(token)); functionCall.getOutputCollector().add(result); } } }
  195. 195. public class NGramTokenizer extends BaseOperation implements Function { private int size; public NGramTokenizer(int size) { super(new Fields("ngram")); this.size = size; } public void operate(FlowProcess flowProcess, FunctionCall functionCall) { String token; TupleEntry arguments = functionCall.getArguments(); com.attinteractive.util.NGramTokenizer t = new com.attinteractive.util.NGramTokenizer(arguments.getString(0), this.size); while((token = t.next()) != null) { TupleEntry result = new TupleEntry(new Fields("ngram"), new Tuple(token)); functionCall.getOutputCollector().add(result); } } }
  196. 196. Success line mary had a little lamb Pipe pipe = new Each("ngram pipe", new Fields("line"), new NGramTokenizer(2), new Fields("ngram")); ngram ngram ngram ngram mary had had a a little little lamb
  197. 197. code examples
  198. 198. dot
  199. 199. dot Flow importFlow = flowConnector.connect(sourceTap, tmpIntermediateTap, importPipe); importFlow.writeDOT( "import-flow.dot" );
  200. 200. What next?
  201. 201. What next?
  202. 202. What next? • http://www.cascading.org/
  203. 203. What next? • http://www.cascading.org/ • http://bit.ly/cascading
  204. 204. Questions? <nmurray@att.com>
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×