0
Cascading

                           www.cascading.org
                           info@cascading.org
Wednesday, May 14, 2...
Design Goals
                   Make large processing jobs more transparent
                   Reusable processing compone...
Cascading Introduction




Wednesday, May 14, 2008
Tuple Streams
                                                                  Value Stream       Group Stream
          ...
Tuple Streams
                                                                                             [values]

     ...
Stream Processing
                                                        Flow
                                           ...
Processing Patterns
                                       Source   Group    Sink   Source   Sink




                   C...
MapReduce Planner
                 Flow     Job
                                                               Flow       ...
Topological Scheduler

                   Flows walk MapReduce Jobs in dependency order
                   Cascades walk F...
Scripting - Groovy
      Flow flow = builder.flow("wordcount")
       {
         source(input, scheme: text()) // input is fi...
System Integration
                   FileSystems (unique to Cascading)
                      Raw file S3 reading/writing (...
Cascading API & Internals




Wednesday, May 14, 2008
Core Concepts
                     Taps and Schemes

                     Tuples and Fields

                     Pipes an...
Taps and Schemes

                   Taps, abstract out where and how a data resources is accessed
                       ...
Tuples and Fields

                   Tuples are the ‘records’, read from Tap sources, written to Tap sinks
              ...
Pipes and PipeAssemblies
                   Tuple streams pass through Pipes to be processed
                   Pipes, app...
Group Class and Subclasses
                   Group, subclass of Pipe, groups the Tuple stream on given fields
            ...
Each and Every Classes
                   Each, subclass of Pipe, applies Functions and Filters to each Tuple instance
   ...
Flows and FlowConnectors

                   Flows encapsulate assemblies and sink and source Taps
                   Flow...
FlowSteps and FlowConnectors
                   Internally, FlowConnectors ‘compile’ assemblies into FlowSteps
           ...
Cascades and CascadeConnectors
                   Are optional
                   Cascades bind Flows together via shared ...
Syntax
                   Each( previous, argSelector, function/filter, resultSelector )
                   Every( previous...
Upcoming SlideShare
Loading in...5
×

Cascading[1]

1,252

Published on

Published in: Technology, Business
0 Comments
3 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
1,252
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
46
Comments
0
Likes
3
Embeds 0
No embeds

No notes for slide

Transcript of "Cascading[1]"

  1. 1. Cascading www.cascading.org info@cascading.org Wednesday, May 14, 2008
  2. 2. Design Goals Make large processing jobs more transparent Reusable processing components independent of resources Incremental “data” builds Simplify testing of processes Scriptable from higher level languages (Groovy, JRuby, Jython, etc) Wednesday, May 14, 2008
  3. 3. Cascading Introduction Wednesday, May 14, 2008
  4. 4. Tuple Streams Value Stream Group Stream [K1,K2,...,Kn [V1,V2,...,Vn [V1,V2,...,Vn Tuple [V1,V2,...,Vn [V1,V2,...,Vn A set of ordered data [“John”, “Doe”, 39] [V1,V2,...,Vn [V1,V2,...,Vn [K1,K2,...,Kn [V1,V2,...,Vn Value Stream [V1,V2,...,Vn [V1,V2,...,Vn [V1,V2,...,Vn Just tuples [V1,V2,...,Vn [V1,V2,...,Vn [V1,V2,...,Vn [V1,V2,...,Vn Group Stream Tuples groups by a key Wednesday, May 14, 2008
  5. 5. Tuple Streams [values] Scalar functions and filters Source Apply to value and group streams [values] func [values] Aggregate functions Apply to group stream [values] [groups/values] Group Functions can be chained [groups] [values] aggr func [values] Sink Source func Group aggr Sink Wednesday, May 14, 2008
  6. 6. Stream Processing Flow Pipe Assembly S F F G A A S Pipe Assemblies A chain of scalar functions, groupings, aggregate functions Reusable, independent of data source/sink Flows Cascade S F S F S Assemblies plus sources and sinks S F Cascades S F S A collection of Flows S F Wednesday, May 14, 2008
  7. 7. Processing Patterns Source Group Sink Source Sink Chain Group Sink Sink Source Source Splits Group Sink Sink Source Joins Group Sink Source Cross Source Group Sink Wednesday, May 14, 2008
  8. 8. MapReduce Planner Flow Job Flow Job Map Map Reduce F F F Reduce S F G A S G A S F Map Job Map S F F Job Reduce Map Flows are logical ‘units of work’ G A S F S S F Map Flows ‘compiled’ into MR Jobs Intermediate files are created (and destroyed) to join Jobs Wednesday, May 14, 2008
  9. 9. Topological Scheduler Flows walk MapReduce Jobs in dependency order Cascades walk Flows in dependency order Independent Jobs and Flows are scheduled to run concurrently Listeners can react to element events (notify completion or failures) Only stale data-sets are rebuilt (configurable) Wednesday, May 14, 2008
  10. 10. Scripting - Groovy Flow flow = builder.flow("wordcount") { source(input, scheme: text()) // input is filename of raw text document tokenize(/[.,]*s+/) // output new tuple for each split, result replaces stream by default group() // group on stream count() // count values in group, creates 'count' field by default group(["count"], reverse: true) // group/sort on 'count', reverse the sort order sink(output) } flow.complete() // execute, block till completed Wednesday, May 14, 2008
  11. 11. System Integration FileSystems (unique to Cascading) Raw file S3 reading/writing (MD5) Raw file HTTP reading (MD5) Zip files Can bypass native Hadoop ‘collectors’ Event notification via listeners (XMPP/SQS/Zookeeper notifications) Groovy scripting for easier local shell/file operations (wget, scp, etc) Wednesday, May 14, 2008
  12. 12. Cascading API & Internals Wednesday, May 14, 2008
  13. 13. Core Concepts Taps and Schemes Tuples and Fields Pipes and PipeAssemblies Each and Every Operators Groups Flows, FlowSteps, and FlowConnectors Cascades, and CascadeConnectors, optional Wednesday, May 14, 2008
  14. 14. Taps and Schemes Taps, abstract out where and how a data resources is accessed hdfs, http, local, S3, etc Taps, used as Tuple (data) stream sinks, sources, or both Schemes, define what a resource is made of text lines, SequenceFile, CSV, etc Wednesday, May 14, 2008
  15. 15. Tuples and Fields Tuples are the ‘records’, read from Tap sources, written to Tap sinks Fields are the ‘column names’, sourced from Schemes Tuple class, an ordered collection of Comparable values (“a string”, 1.0, new SomeComparableWritable()) Fields class, a list of field names, absolute or relative positions (“total”, 3, -1) // fields ‘total’, 4th position, last position Wednesday, May 14, 2008
  16. 16. Pipes and PipeAssemblies Tuple streams pass through Pipes to be processed Pipes, apply functions, filters, and aggregators to the Tuple stream Pipe instances are chained together into assemblies Reusable assemblies are subclasses of class PipeAssembly A B' A E P C' E B' G A B A E E C C' E E Wednesday, May 14, 2008
  17. 17. Group Class and Subclasses Group, subclass of Pipe, groups the Tuple stream on given fields GroupBy and CoGroup subclass Group GroupBy groups and sorts CoGroup performs joins T E Fe Fa G A T T E G A T T E Wednesday, May 14, 2008
  18. 18. Each and Every Classes Each, subclass of Pipe, applies Functions and Filters to each Tuple instance (a,b,c) -> Each( func() ) -> (a,b,c,d) Every, subclass of Pipe, applies Aggregators to every Tuple group (a: b,c) -> Every( agg()) -> (a,d: b,c) Fe Fa E A Wednesday, May 14, 2008
  19. 19. Flows and FlowConnectors Flows encapsulate assemblies and sink and source Taps FlowConnectors connect assemblies and Taps into Flows Flow FlowStep Flow FlowStep E G A T T E T E G A T FlowStep T E E E G A T Wednesday, May 14, 2008
  20. 20. FlowSteps and FlowConnectors Internally, FlowConnectors ‘compile’ assemblies into FlowSteps FlowSteps are MapReduce jobs, which are executed in Topo order Temporary files are created to link FlowSteps Flow FlowStep FlowStep Map Stack Reduce Stack Map Reduce Stack Stack T E G A T G E T Wednesday, May 14, 2008
  21. 21. Cascades and CascadeConnectors Are optional Cascades bind Flows together via shared Taps CascadeConnectors connect Flows Flows are executed in Topo order Cascade T F T F T T F T E T F T Wednesday, May 14, 2008
  22. 22. Syntax Each( previous, argSelector, function/filter, resultSelector ) Every( previous, argSelector, aggregator, resultSelector ) GroupBy( previous, groupSelector, sortSelector ) CoGroup( joinN, joiner, declaredFields ) Function( numArgs, declaredFields, .... ) Filter (numArgs, ... ) Aggregator( numArgs, declaredFields, ... ) Wednesday, May 14, 2008
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×