The Cascading 
(big) data 
application framework 
André Kelpe | HUG France | Paris | 25. November 2014
Who am I? 
André Kelpe 
Senior Software Engineer at Concurrent 
company behind Cascading, Lingual and 
Driven 
http://concurrentinc.com / @concurrent 
andre@concurrentinc.com / @fs111
http://cascading.org 
Apache licensed Java framework for writing data 
oriented applications 
production ready, stable and battle proven 
(soundcloud, twitter, etsy, climate corp + many 
more)
Cascading goals 
developer productivity 
focus on business problems, not distributed 
systems knowledge 
useful abstractions over underlying „fabrics“
Cascading goals 
Testability & robustness 
production quality applications rather than a 
collection of scripts 
(hooks into the core for experts)
https://www.flickr.com/photos/theilr/4283377543/sizes/l
Cascading terminology 
Taps are sources and sinks for data 
Schemes represent the format of the data 
Pipes are connecting Taps
Cascading terminology 
● Tuples flow through Pipes 
● Fields describe the Tuples 
● Operations are executed on Tuples in 
TupleStreams 
● FlowConnector uses QueryPlanner to 
translate FlowDef into Flow to run on 
computational fabric
Compiler 
QueryPlanner 
FlowDef 
FlowDef 
FlowDef 
Hadoop 
FlowDef Tez 
Spark 
User Code Translation 
Optimization 
Assembly 
CPU Architecture
User-APIs 
● Fluid - A Fluent API for Cascading 
– Targeted at application writers 
– https://github.com/Cascading/fluid 
● „Raw“ Cascading API 
– Targeted for library writers, code generators, 
integration layers 
– https://github.com/Cascading/cascading
Counting words 
// configuration 
String docPath = args[ 0 ]; 
String wcPath = args[ 1 ]; 
Properties properties = new Properties(); 
AppProps.setApplicationJarClass( properties, Main.class ); 
FlowConnector flowConnector = new Hadoop2MR1FlowConnector( properties ); 
// create source and sink taps 
Tap docTap = new Hfs( new TextDelimited( true, "t" ), docPath ); 
Tap wcTap = new Hfs( new TextDelimited( true, "t" ), wcPath ); 
...
Counting words (cont.) 
// specify a regex operation to split the "document" text lines into a 
token stream 
Fields token = new Fields( "token" ); 
Fields text = new Fields( "text" ); 
RegexSplitGenerator splitter = 
new RegexSplitGenerator( token, "[ [](),.]" ); 
// only returns "token" 
Pipe docPipe = new Each( "token", text, splitter, Fields.RESULTS ); 
// determine the word counts 
Pipe wcPipe = new Pipe( "wc", docPipe ); 
wcPipe = new GroupBy( wcPipe, token ); 
wcPipe = new Every( wcPipe, Fields.ALL, new Count(), Fields.ALL ); 
...
Counting words (cont.) 
// connect the taps, pipes, etc., into a flow 
FlowDef flowDef = FlowDef.flowDef() 
.setName( "wc" ) 
.addSource( docPipe, docTap ) 
.addTailSink( wcPipe, wcTap ); 
Flow wcFlow = flowConnector.connect( flowDef ) 
wcFlow.complete(); // ← runs the code 
}
https://driven.cascading.io/driven/871A2C66DA1D 
4841B229CDD2B04B9FDA
Impatient 
Cascading for the Impatient 
http://docs.cascading.org/impatient/index.html
● Operations 
A full toolbox 
– Function 
– Filter 
– Regex/Scripts 
– Boolean operators 
– Count/Limit/Last/First 
– Scripts 
– Unique 
– Asserts 
– Min/Max 
– … 
● Splices 
– GroupBy 
– CoGroup 
– HashJoin 
– Merge 
● Joins 
Left, right, outer, inner, 
mixed...
A full toolbox 
data access: JDBC, HBase, elasticsearch, 
redshift, HDFS, S3, Cassandra... 
data formats: avro, thrift, protobuf, CSV, TSV... 
integration points: Cascading Lingual (SQL), 
Apache Hive, classical M/R apps.. 
not Java?: Scalding (Scala), Cascalog (clojure)
Status quo 
● Cascading 2.6 
– Production release 
● Hadoop 2.x 
● Hadoop 1.x 
● Local mode 
● Cascading 3.0 
– public wip builds 
● Tez 
● Hadoop 2.x 
● Hadoop 1.x 
● Local mode 
● Others (Spark...)
Questions? 
andre@concurrentinc.com
Link Collection 
http://www.cascading.org/ 
https://github.com/Cascading/ 
http://concurrentinc.com 
http://cascading.io/driven/ 
https://groups.google.com/forum/#!forum/cascading-user 
http://docs.cascading.org/impatient/ 
http://docs.cascading.org/cascading/2.6/userguide/html/
fin.

The Cascading (big) data application framework - André Keple, Sr. Engineer, Concurrent

  • 1.
    The Cascading (big)data application framework André Kelpe | HUG France | Paris | 25. November 2014
  • 2.
    Who am I? André Kelpe Senior Software Engineer at Concurrent company behind Cascading, Lingual and Driven http://concurrentinc.com / @concurrent andre@concurrentinc.com / @fs111
  • 3.
    http://cascading.org Apache licensedJava framework for writing data oriented applications production ready, stable and battle proven (soundcloud, twitter, etsy, climate corp + many more)
  • 4.
    Cascading goals developerproductivity focus on business problems, not distributed systems knowledge useful abstractions over underlying „fabrics“
  • 5.
    Cascading goals Testability& robustness production quality applications rather than a collection of scripts (hooks into the core for experts)
  • 6.
  • 7.
    Cascading terminology Tapsare sources and sinks for data Schemes represent the format of the data Pipes are connecting Taps
  • 8.
    Cascading terminology ●Tuples flow through Pipes ● Fields describe the Tuples ● Operations are executed on Tuples in TupleStreams ● FlowConnector uses QueryPlanner to translate FlowDef into Flow to run on computational fabric
  • 9.
    Compiler QueryPlanner FlowDef FlowDef FlowDef Hadoop FlowDef Tez Spark User Code Translation Optimization Assembly CPU Architecture
  • 10.
    User-APIs ● Fluid- A Fluent API for Cascading – Targeted at application writers – https://github.com/Cascading/fluid ● „Raw“ Cascading API – Targeted for library writers, code generators, integration layers – https://github.com/Cascading/cascading
  • 11.
    Counting words //configuration String docPath = args[ 0 ]; String wcPath = args[ 1 ]; Properties properties = new Properties(); AppProps.setApplicationJarClass( properties, Main.class ); FlowConnector flowConnector = new Hadoop2MR1FlowConnector( properties ); // create source and sink taps Tap docTap = new Hfs( new TextDelimited( true, "t" ), docPath ); Tap wcTap = new Hfs( new TextDelimited( true, "t" ), wcPath ); ...
  • 12.
    Counting words (cont.) // specify a regex operation to split the "document" text lines into a token stream Fields token = new Fields( "token" ); Fields text = new Fields( "text" ); RegexSplitGenerator splitter = new RegexSplitGenerator( token, "[ [](),.]" ); // only returns "token" Pipe docPipe = new Each( "token", text, splitter, Fields.RESULTS ); // determine the word counts Pipe wcPipe = new Pipe( "wc", docPipe ); wcPipe = new GroupBy( wcPipe, token ); wcPipe = new Every( wcPipe, Fields.ALL, new Count(), Fields.ALL ); ...
  • 13.
    Counting words (cont.) // connect the taps, pipes, etc., into a flow FlowDef flowDef = FlowDef.flowDef() .setName( "wc" ) .addSource( docPipe, docTap ) .addTailSink( wcPipe, wcTap ); Flow wcFlow = flowConnector.connect( flowDef ) wcFlow.complete(); // ← runs the code }
  • 14.
  • 15.
    Impatient Cascading forthe Impatient http://docs.cascading.org/impatient/index.html
  • 16.
    ● Operations Afull toolbox – Function – Filter – Regex/Scripts – Boolean operators – Count/Limit/Last/First – Scripts – Unique – Asserts – Min/Max – … ● Splices – GroupBy – CoGroup – HashJoin – Merge ● Joins Left, right, outer, inner, mixed...
  • 17.
    A full toolbox data access: JDBC, HBase, elasticsearch, redshift, HDFS, S3, Cassandra... data formats: avro, thrift, protobuf, CSV, TSV... integration points: Cascading Lingual (SQL), Apache Hive, classical M/R apps.. not Java?: Scalding (Scala), Cascalog (clojure)
  • 18.
    Status quo ●Cascading 2.6 – Production release ● Hadoop 2.x ● Hadoop 1.x ● Local mode ● Cascading 3.0 – public wip builds ● Tez ● Hadoop 2.x ● Hadoop 1.x ● Local mode ● Others (Spark...)
  • 19.
  • 20.
    Link Collection http://www.cascading.org/ https://github.com/Cascading/ http://concurrentinc.com http://cascading.io/driven/ https://groups.google.com/forum/#!forum/cascading-user http://docs.cascading.org/impatient/ http://docs.cascading.org/cascading/2.6/userguide/html/
  • 21.