2. What is Cascading?
“Cascading is the proven application development
platform for building data applications on Hadoop.”
(www.cascading.org)
Java API for large-scale batch processing
Programs are specified as data flows
• pipes, taps, flow, cascade, …
• each, groupBy, every, coGroup, merge, …
Open Source (AL2)
• Developed by Concurrent
2
3. Cascading on MapReduce
Originally for Hadoop MapReduce
Much better API than MapReduce
• DAG programming model
• Higher-level operators (join, coGroup, merge)
• Composable and reusable code
Automatic translation to MapReduce jobs
• Minimizes number of MapReduce jobs
Rock-solid execution due to Hadoop MapReduce
3
4. Cascading Example
4
Compute TF-IDF scores for a set of documents
• TF-IDF: Term-Frequency / Inverted-Document-Frequency
• Used for weighting the relevance of terms in search engines
Building this against the MapReduce API is painful!
Example taken from docs.cascading.org/impatient
5. Who uses Cascading?
Runs in many production environments
• Twitter, Soundcloud, Etsy, Airbnb, …
More APIs have been put on top
• Scalding (Scala) by Twitter
• Cascalog (Datalog)
• Lingual (SQL)
• Fluent (fluent Java API)
5
6. Cascading 3.0
Released in June 2015
A new planner
• Execution backend can be changed
Apache Tez executor
• Cascading programs are compiled to Tez jobs
• No identity mappers
• No writing to HDFS between jobs
6
8. Why Cascading on Flink?
Flink’s unique batch processing runtime
• Pipelined data exchange
• Actively managed memory on- & off-heap
• Efficient in-memory & out-of-core operators
• Sorting and hashing on binary data
• No tuning for robust operation (OOME, GC)
YARN integration
8
9. Cascading on Flink Released
Available on Github
• Apache License V2
Depends on
• Cascading 3.1 WIP
• Flink 0.10-SNAPSHOT
• Will be fixed to next releases of Cascading and Flink
Check Github for details:
http://github.com/dataartisans/cascading-flink
9
11. Flow Translation
Implemented on top of Java DataSet API
Using Cascading’s rule-based planner
• Flow is compiled into a single Flink job
• The operators of a job are partitioned into nodes
• Chaining of operators
Translation rules partition the flow if
• Data is shuffled
• Data is processed by Flink’s internal operators
• Flows branch or merge
• At sources and sinks
11
12. Operator Translation
12
Cascading operators have fixed execution strategy
• No degree of freedom for Flink’s optimizer
• Strategies fixed using hints for Flink’s optimizer
Cascading Operator Flink Operator(s)
(n-ary) GroupBy (Union -) Reduce
(n-ary) BufferJoin CoGroup (Union -) Reduce
(n-ary) CoGroup (Sequence of) binary
hash-partitioned, sorted OuterJoin
(n-ary) HashJoin (Sequence of) binary Broadcasted
HashJoin
(n-ary) Merge n-ary Union
Tap Source or Sink
13. Serializers & Comparators
Flink needs information about all processed data types
• Generation of serializer and comparators
Cascading supports
• Schema-less tuples (no length, no types)
• Definition of key fields by name and (relative) position
• Null-valued fields and key fields
Custom type information for Cascading tuples
• Native serializers & comparators for fields with known type
• Kryo for unknown field types
• Support for null values by wrapping serializers & comparators
13
14. Going Out-of-Core
Join and CoGroup must hold data in memory
• If data exceeds memory, we need to go to disk
Cascading on MR uses spillable collections
• Spill to disk if #elements > threshold
• Part of Cascading (not MapReduce)
• Threshold either too low or too high
Cascading on Flink uses Flink’s Join and OuterJoin
• Part of Flink (not Cascading)
• Backed by Flink’s manage memory
• Transparently spill to disk if necessary
14
16. How to Run Cascading on Flink
Add the cascading-flink Maven dependency to your
Cascading project
• Available in Sonatype Nexus Repository
• Or build it from source (Github)
Change just one line of code in your Cascading
program
• Replace Hadoop2MR1FlowConnector by FlinkConnector
• Do not change any application logic
Execute Cascading program as regular Flink program
Detailed instructions on Github
16
18. Baseline Wordcount
18
0 2 4 6 8 10 12 14 16
MapReduce native
Flink native
Cascading on MR
Cascading on Flink
Execution Time (min)
Cascading on MR compiled to 1 MR job
• Similar execution strategy (hash-partition, sort)
• No significant speed gain expected
Verifies of our implementation
Hash-Aggregators!
19. Something more complex: TF-IDF
Taken from “Cascading for the impatient”
• 2 CoGroup, 7 GroupBy, 1 HashJoin
19http://docs.cascading.org/impatient
20. TF-IDF on MapReduce
Cascading on MapReduce translates the
TF-IDF program to 9 MapReduce jobs
Each job
• Reads data from HDFS
• Applies a Map function
• Shuffles the data over the network
• Sorts the data
• Applies a Reduce function
• Writes the data to HDFS
20
21. TF-IDF on Flink
Cascading on Flink translates the TF-
IDF job into one Flink job
21
22. TF-IDF on Flink
Shuffle is pipelined
Intermediate results are not written to or read
from HDFS
22
0 200 400 600
Cascading on MR
Cascading on Flink
Execution Time (min)
23. Conclusion
Executing Cascading jobs on Apache Flink
• Improves runtime
• Reduces parameter tuning and avoids failures
• Virtually no code changes
Apache Flink’s runtime is very versatile
• Apache Hadoop MR
• Apache Storm
• Google Dataflow
• Apache Samoa (incubating)
• + Flink’s own APIs and libraries…
23