Fabian Hueske – Cascading on Flink

Cascading on Flink
Fabian Hueske
@fhueske

What is Cascading?
“Cascading is the proven application development
platform for building data applications on Hadoop.”
(www.cascading.org)
 Java API for large-scale batch processing
 Programs are specified as data flows
• pipes, taps, flow, cascade, …
• each, groupBy, every, coGroup, merge, …
 Open Source (AL2)
• Developed by Concurrent
2

Cascading on MapReduce
 Originally for Hadoop MapReduce
 Much better API than MapReduce
• DAG programming model
• Higher-level operators (join, coGroup, merge)
• Composable and reusable code
 Automatic translation to MapReduce jobs
• Minimizes number of MapReduce jobs
 Rock-solid execution due to Hadoop MapReduce
3

Cascading Example
4
 Compute TF-IDF scores for a set of documents
• TF-IDF: Term-Frequency / Inverted-Document-Frequency
• Used for weighting the relevance of terms in search engines
 Building this against the MapReduce API is painful!
Example taken from docs.cascading.org/impatient

Who uses Cascading?
 Runs in many production environments
• Twitter, Soundcloud, Etsy, Airbnb, …
 More APIs have been put on top
• Scalding (Scala) by Twitter
• Cascalog (Datalog)
• Lingual (SQL)
• Fluent (fluent Java API)
5

Cascading 3.0
 Released in June 2015
 A new planner
• Execution backend can be changed
 Apache Tez executor
• Cascading programs are compiled to Tez jobs
• No identity mappers
• No writing to HDFS between jobs
6

Why Cascading on Flink?
 Flink’s unique batch processing runtime
• Pipelined data exchange
• Actively managed memory on- & off-heap
• Efficient in-memory & out-of-core operators
• Sorting and hashing on binary data
• No tuning for robust operation (OOME, GC)
 YARN integration
8

Cascading on Flink Released
 Available on Github
• Apache License V2
 Depends on
• Cascading 3.1 WIP
• Flink 0.10-SNAPSHOT
• Will be fixed to next releases of Cascading and Flink
 Check Github for details:
http://github.com/dataartisans/cascading-flink
9

Flow Translation
 Implemented on top of Java DataSet API
 Using Cascading’s rule-based planner
• Flow is compiled into a single Flink job
• The operators of a job are partitioned into nodes
• Chaining of operators
 Translation rules partition the flow if
• Data is shuffled
• Data is processed by Flink’s internal operators
• Flows branch or merge
• At sources and sinks
11

Operator Translation
12
 Cascading operators have fixed execution strategy
• No degree of freedom for Flink’s optimizer
• Strategies fixed using hints for Flink’s optimizer
Cascading Operator Flink Operator(s)
(n-ary) GroupBy (Union -) Reduce
(n-ary) BufferJoin CoGroup (Union -) Reduce
(n-ary) CoGroup (Sequence of) binary
hash-partitioned, sorted OuterJoin
(n-ary) HashJoin (Sequence of) binary Broadcasted
HashJoin
(n-ary) Merge n-ary Union
Tap Source or Sink

Serializers & Comparators
 Flink needs information about all processed data types
• Generation of serializer and comparators
 Cascading supports
• Schema-less tuples (no length, no types)
• Definition of key fields by name and (relative) position
• Null-valued fields and key fields
 Custom type information for Cascading tuples
• Native serializers & comparators for fields with known type
• Kryo for unknown field types
• Support for null values by wrapping serializers & comparators
13

Going Out-of-Core
 Join and CoGroup must hold data in memory
• If data exceeds memory, we need to go to disk
 Cascading on MR uses spillable collections
• Spill to disk if #elements > threshold
• Part of Cascading (not MapReduce)
• Threshold either too low or too high
 Cascading on Flink uses Flink’s Join and OuterJoin
• Part of Flink (not Cascading)
• Backed by Flink’s manage memory
• Transparently spill to disk if necessary
14

How to Run Cascading on Flink
 Add the cascading-flink Maven dependency to your
Cascading project
• Available in Sonatype Nexus Repository
• Or build it from source (Github)
 Change just one line of code in your Cascading
program
• Replace Hadoop2MR1FlowConnector by FlinkConnector
• Do not change any application logic
 Execute Cascading program as regular Flink program
 Detailed instructions on Github
16

(Preliminary!) Performance Evaluation
 8 worker node
• 8 CPUs, 30GB RAM, 2 local SSDs
 Hadoop 2.7.1 (YARN, HDFS, MapReduce)
 Flink 0.10-SNAPSHOT
 80GB generated text data
17

Baseline Wordcount
18
0 2 4 6 8 10 12 14 16
MapReduce native
Flink native
Cascading on MR
Cascading on Flink
Execution Time (min)
 Cascading on MR compiled to 1 MR job
• Similar execution strategy (hash-partition, sort)
• No significant speed gain expected
 Verifies of our implementation
 Hash-Aggregators!

Something more complex: TF-IDF
 Taken from “Cascading for the impatient”
• 2 CoGroup, 7 GroupBy, 1 HashJoin
19http://docs.cascading.org/impatient

TF-IDF on MapReduce
 Cascading on MapReduce translates the
TF-IDF program to 9 MapReduce jobs
 Each job
• Reads data from HDFS
• Applies a Map function
• Shuffles the data over the network
• Sorts the data
• Applies a Reduce function
• Writes the data to HDFS
20

TF-IDF on Flink
 Cascading on Flink translates the TF-
IDF job into one Flink job
21

TF-IDF on Flink
 Shuffle is pipelined
 Intermediate results are not written to or read
from HDFS
22
0 200 400 600
Cascading on MR
Cascading on Flink
Execution Time (min)

Conclusion
 Executing Cascading jobs on Apache Flink
• Improves runtime
• Reduces parameter tuning and avoids failures
• Virtually no code changes
 Apache Flink’s runtime is very versatile
• Apache Hadoop MR
• Apache Storm
• Google Dataflow
• Apache Samoa (incubating)
• + Flink’s own APIs and libraries…
23

Fabian Hueske – Cascading on Flink

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Fabian Hueske – Cascading on Flink

Similar to Fabian Hueske – Cascading on Flink (20)

More from Flink Forward

More from Flink Forward (20)

Recently uploaded

Recently uploaded (20)

Fabian Hueske – Cascading on Flink