Your SlideShare is downloading. ×
Cascading[1]
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Cascading[1]

1,219
views

Published on

Published in: Technology, Business

0 Comments
3 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
1,219
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
43
Comments
0
Likes
3
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Cascading www.cascading.org info@cascading.org Wednesday, May 14, 2008
  • 2. Design Goals Make large processing jobs more transparent Reusable processing components independent of resources Incremental “data” builds Simplify testing of processes Scriptable from higher level languages (Groovy, JRuby, Jython, etc) Wednesday, May 14, 2008
  • 3. Cascading Introduction Wednesday, May 14, 2008
  • 4. Tuple Streams Value Stream Group Stream [K1,K2,...,Kn [V1,V2,...,Vn [V1,V2,...,Vn Tuple [V1,V2,...,Vn [V1,V2,...,Vn A set of ordered data [“John”, “Doe”, 39] [V1,V2,...,Vn [V1,V2,...,Vn [K1,K2,...,Kn [V1,V2,...,Vn Value Stream [V1,V2,...,Vn [V1,V2,...,Vn [V1,V2,...,Vn Just tuples [V1,V2,...,Vn [V1,V2,...,Vn [V1,V2,...,Vn [V1,V2,...,Vn Group Stream Tuples groups by a key Wednesday, May 14, 2008
  • 5. Tuple Streams [values] Scalar functions and filters Source Apply to value and group streams [values] func [values] Aggregate functions Apply to group stream [values] [groups/values] Group Functions can be chained [groups] [values] aggr func [values] Sink Source func Group aggr Sink Wednesday, May 14, 2008
  • 6. Stream Processing Flow Pipe Assembly S F F G A A S Pipe Assemblies A chain of scalar functions, groupings, aggregate functions Reusable, independent of data source/sink Flows Cascade S F S F S Assemblies plus sources and sinks S F Cascades S F S A collection of Flows S F Wednesday, May 14, 2008
  • 7. Processing Patterns Source Group Sink Source Sink Chain Group Sink Sink Source Source Splits Group Sink Sink Source Joins Group Sink Source Cross Source Group Sink Wednesday, May 14, 2008
  • 8. MapReduce Planner Flow Job Flow Job Map Map Reduce F F F Reduce S F G A S G A S F Map Job Map S F F Job Reduce Map Flows are logical ‘units of work’ G A S F S S F Map Flows ‘compiled’ into MR Jobs Intermediate files are created (and destroyed) to join Jobs Wednesday, May 14, 2008
  • 9. Topological Scheduler Flows walk MapReduce Jobs in dependency order Cascades walk Flows in dependency order Independent Jobs and Flows are scheduled to run concurrently Listeners can react to element events (notify completion or failures) Only stale data-sets are rebuilt (configurable) Wednesday, May 14, 2008
  • 10. Scripting - Groovy Flow flow = builder.flow("wordcount") { source(input, scheme: text()) // input is filename of raw text document tokenize(/[.,]*s+/) // output new tuple for each split, result replaces stream by default group() // group on stream count() // count values in group, creates 'count' field by default group(["count"], reverse: true) // group/sort on 'count', reverse the sort order sink(output) } flow.complete() // execute, block till completed Wednesday, May 14, 2008
  • 11. System Integration FileSystems (unique to Cascading) Raw file S3 reading/writing (MD5) Raw file HTTP reading (MD5) Zip files Can bypass native Hadoop ‘collectors’ Event notification via listeners (XMPP/SQS/Zookeeper notifications) Groovy scripting for easier local shell/file operations (wget, scp, etc) Wednesday, May 14, 2008
  • 12. Cascading API & Internals Wednesday, May 14, 2008
  • 13. Core Concepts Taps and Schemes Tuples and Fields Pipes and PipeAssemblies Each and Every Operators Groups Flows, FlowSteps, and FlowConnectors Cascades, and CascadeConnectors, optional Wednesday, May 14, 2008
  • 14. Taps and Schemes Taps, abstract out where and how a data resources is accessed hdfs, http, local, S3, etc Taps, used as Tuple (data) stream sinks, sources, or both Schemes, define what a resource is made of text lines, SequenceFile, CSV, etc Wednesday, May 14, 2008
  • 15. Tuples and Fields Tuples are the ‘records’, read from Tap sources, written to Tap sinks Fields are the ‘column names’, sourced from Schemes Tuple class, an ordered collection of Comparable values (“a string”, 1.0, new SomeComparableWritable()) Fields class, a list of field names, absolute or relative positions (“total”, 3, -1) // fields ‘total’, 4th position, last position Wednesday, May 14, 2008
  • 16. Pipes and PipeAssemblies Tuple streams pass through Pipes to be processed Pipes, apply functions, filters, and aggregators to the Tuple stream Pipe instances are chained together into assemblies Reusable assemblies are subclasses of class PipeAssembly A B' A E P C' E B' G A B A E E C C' E E Wednesday, May 14, 2008
  • 17. Group Class and Subclasses Group, subclass of Pipe, groups the Tuple stream on given fields GroupBy and CoGroup subclass Group GroupBy groups and sorts CoGroup performs joins T E Fe Fa G A T T E G A T T E Wednesday, May 14, 2008
  • 18. Each and Every Classes Each, subclass of Pipe, applies Functions and Filters to each Tuple instance (a,b,c) -> Each( func() ) -> (a,b,c,d) Every, subclass of Pipe, applies Aggregators to every Tuple group (a: b,c) -> Every( agg()) -> (a,d: b,c) Fe Fa E A Wednesday, May 14, 2008
  • 19. Flows and FlowConnectors Flows encapsulate assemblies and sink and source Taps FlowConnectors connect assemblies and Taps into Flows Flow FlowStep Flow FlowStep E G A T T E T E G A T FlowStep T E E E G A T Wednesday, May 14, 2008
  • 20. FlowSteps and FlowConnectors Internally, FlowConnectors ‘compile’ assemblies into FlowSteps FlowSteps are MapReduce jobs, which are executed in Topo order Temporary files are created to link FlowSteps Flow FlowStep FlowStep Map Stack Reduce Stack Map Reduce Stack Stack T E G A T G E T Wednesday, May 14, 2008
  • 21. Cascades and CascadeConnectors Are optional Cascades bind Flows together via shared Taps CascadeConnectors connect Flows Flows are executed in Topo order Cascade T F T F T T F T E T F T Wednesday, May 14, 2008
  • 22. Syntax Each( previous, argSelector, function/filter, resultSelector ) Every( previous, argSelector, aggregator, resultSelector ) GroupBy( previous, groupSelector, sortSelector ) CoGroup( joinN, joiner, declaredFields ) Function( numArgs, declaredFields, .... ) Filter (numArgs, ... ) Aggregator( numArgs, declaredFields, ... ) Wednesday, May 14, 2008

×