Published on

Published in: Technology, Business
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide


  1. 1. Cascading Wednesday, May 14, 2008
  2. 2. Design Goals Make large processing jobs more transparent Reusable processing components independent of resources Incremental “data” builds Simplify testing of processes Scriptable from higher level languages (Groovy, JRuby, Jython, etc) Wednesday, May 14, 2008
  3. 3. Cascading Introduction Wednesday, May 14, 2008
  4. 4. Tuple Streams Value Stream Group Stream [K1,K2,...,Kn [V1,V2,...,Vn [V1,V2,...,Vn Tuple [V1,V2,...,Vn [V1,V2,...,Vn A set of ordered data [“John”, “Doe”, 39] [V1,V2,...,Vn [V1,V2,...,Vn [K1,K2,...,Kn [V1,V2,...,Vn Value Stream [V1,V2,...,Vn [V1,V2,...,Vn [V1,V2,...,Vn Just tuples [V1,V2,...,Vn [V1,V2,...,Vn [V1,V2,...,Vn [V1,V2,...,Vn Group Stream Tuples groups by a key Wednesday, May 14, 2008
  5. 5. Tuple Streams [values] Scalar functions and filters Source Apply to value and group streams [values] func [values] Aggregate functions Apply to group stream [values] [groups/values] Group Functions can be chained [groups] [values] aggr func [values] Sink Source func Group aggr Sink Wednesday, May 14, 2008
  6. 6. Stream Processing Flow Pipe Assembly S F F G A A S Pipe Assemblies A chain of scalar functions, groupings, aggregate functions Reusable, independent of data source/sink Flows Cascade S F S F S Assemblies plus sources and sinks S F Cascades S F S A collection of Flows S F Wednesday, May 14, 2008
  7. 7. Processing Patterns Source Group Sink Source Sink Chain Group Sink Sink Source Source Splits Group Sink Sink Source Joins Group Sink Source Cross Source Group Sink Wednesday, May 14, 2008
  8. 8. MapReduce Planner Flow Job Flow Job Map Map Reduce F F F Reduce S F G A S G A S F Map Job Map S F F Job Reduce Map Flows are logical ‘units of work’ G A S F S S F Map Flows ‘compiled’ into MR Jobs Intermediate files are created (and destroyed) to join Jobs Wednesday, May 14, 2008
  9. 9. Topological Scheduler Flows walk MapReduce Jobs in dependency order Cascades walk Flows in dependency order Independent Jobs and Flows are scheduled to run concurrently Listeners can react to element events (notify completion or failures) Only stale data-sets are rebuilt (configurable) Wednesday, May 14, 2008
  10. 10. Scripting - Groovy Flow flow = builder.flow("wordcount") { source(input, scheme: text()) // input is filename of raw text document tokenize(/[.,]*s+/) // output new tuple for each split, result replaces stream by default group() // group on stream count() // count values in group, creates 'count' field by default group(["count"], reverse: true) // group/sort on 'count', reverse the sort order sink(output) } flow.complete() // execute, block till completed Wednesday, May 14, 2008
  11. 11. System Integration FileSystems (unique to Cascading) Raw file S3 reading/writing (MD5) Raw file HTTP reading (MD5) Zip files Can bypass native Hadoop ‘collectors’ Event notification via listeners (XMPP/SQS/Zookeeper notifications) Groovy scripting for easier local shell/file operations (wget, scp, etc) Wednesday, May 14, 2008
  12. 12. Cascading API & Internals Wednesday, May 14, 2008
  13. 13. Core Concepts Taps and Schemes Tuples and Fields Pipes and PipeAssemblies Each and Every Operators Groups Flows, FlowSteps, and FlowConnectors Cascades, and CascadeConnectors, optional Wednesday, May 14, 2008
  14. 14. Taps and Schemes Taps, abstract out where and how a data resources is accessed hdfs, http, local, S3, etc Taps, used as Tuple (data) stream sinks, sources, or both Schemes, define what a resource is made of text lines, SequenceFile, CSV, etc Wednesday, May 14, 2008
  15. 15. Tuples and Fields Tuples are the ‘records’, read from Tap sources, written to Tap sinks Fields are the ‘column names’, sourced from Schemes Tuple class, an ordered collection of Comparable values (“a string”, 1.0, new SomeComparableWritable()) Fields class, a list of field names, absolute or relative positions (“total”, 3, -1) // fields ‘total’, 4th position, last position Wednesday, May 14, 2008
  16. 16. Pipes and PipeAssemblies Tuple streams pass through Pipes to be processed Pipes, apply functions, filters, and aggregators to the Tuple stream Pipe instances are chained together into assemblies Reusable assemblies are subclasses of class PipeAssembly A B' A E P C' E B' G A B A E E C C' E E Wednesday, May 14, 2008
  17. 17. Group Class and Subclasses Group, subclass of Pipe, groups the Tuple stream on given fields GroupBy and CoGroup subclass Group GroupBy groups and sorts CoGroup performs joins T E Fe Fa G A T T E G A T T E Wednesday, May 14, 2008
  18. 18. Each and Every Classes Each, subclass of Pipe, applies Functions and Filters to each Tuple instance (a,b,c) -> Each( func() ) -> (a,b,c,d) Every, subclass of Pipe, applies Aggregators to every Tuple group (a: b,c) -> Every( agg()) -> (a,d: b,c) Fe Fa E A Wednesday, May 14, 2008
  19. 19. Flows and FlowConnectors Flows encapsulate assemblies and sink and source Taps FlowConnectors connect assemblies and Taps into Flows Flow FlowStep Flow FlowStep E G A T T E T E G A T FlowStep T E E E G A T Wednesday, May 14, 2008
  20. 20. FlowSteps and FlowConnectors Internally, FlowConnectors ‘compile’ assemblies into FlowSteps FlowSteps are MapReduce jobs, which are executed in Topo order Temporary files are created to link FlowSteps Flow FlowStep FlowStep Map Stack Reduce Stack Map Reduce Stack Stack T E G A T G E T Wednesday, May 14, 2008
  21. 21. Cascades and CascadeConnectors Are optional Cascades bind Flows together via shared Taps CascadeConnectors connect Flows Flows are executed in Topo order Cascade T F T F T T F T E T F T Wednesday, May 14, 2008
  22. 22. Syntax Each( previous, argSelector, function/filter, resultSelector ) Every( previous, argSelector, aggregator, resultSelector ) GroupBy( previous, groupSelector, sortSelector ) CoGroup( joinN, joiner, declaredFields ) Function( numArgs, declaredFields, .... ) Filter (numArgs, ... ) Aggregator( numArgs, declaredFields, ... ) Wednesday, May 14, 2008