Successfully reported this slideshow.
Your SlideShare is downloading. ×

Overview of Cascading 3.0 on Apache Flink

Ad

Cascading on Flink
Fabian Hueske
@fhueske

Ad

What is Cascading?
“Cascading is the proven application
development platform for building data
applications on Hadoop.”
(w...

Ad

Why Cascading?
 Vastly simplified API compared to pure MR API
• Reuse of code, connecting flows, …
 Automatic translatio...

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Check these out next

1 of 16 Ad
1 of 16 Ad

More Related Content

Overview of Cascading 3.0 on Apache Flink

  1. 1. Cascading on Flink Fabian Hueske @fhueske
  2. 2. What is Cascading? “Cascading is the proven application development platform for building data applications on Hadoop.” (www.cascading.org)  Java API for large-scale batch processing  Programs are specified as data flows • pipes, taps, flow, cascade, … • each, groupBy, every, coGroup, merge, …  Originally for Hadoop MapReduce • Compiled to workflows of Hadoop MapReduce jobs  Open Source (AL2) • Developed by Concurrent 2
  3. 3. Why Cascading?  Vastly simplified API compared to pure MR API • Reuse of code, connecting flows, …  Automatic translation to MR jobs • Minimizes number of MR jobs  Rock-solid execution due to Hadoop MapReduce  More APIs have been put on top • Scalding (Scala) by Twitter • Cascalog (Datalog) • Lingual (SQL) • Fluent (fluent Java API)  Runs in many production settings • Twitter, Soundcloud, Etsy, Airbnb, … 3
  4. 4. Cascading Example 4  Compute TF-IDF scores for a set of documents • TF-IDF: Term-Frequency / Inverted-Document-Frequency • Used for weighting the relevance of terms in search engines  Building this against the MapReduce API is painful Example taken from docs.cascading.org/impatient
  5. 5. Cascading 3.0  Released in June 2015  A new planner • Execution backend can be changed  Apache Tez executor • Cascading programs are compiled to Tez jobs • No identity mappers • No writing to HDFS between jobs 5
  6. 6. Why Cascading on Flink?  Flink’s unique batch processing runtime • Pipelined data exchange • Actively managed memory on- & off-heap • Efficient in-memory & out-of-core operators • Sorting and hashing on binary data • No tuning for robust operation (OOME, GC)  YARN integration 6
  7. 7. Cascading on Flink released  Available on Github • Apache License V2  Depends on • Cascading 3.1 WIP • Flink 0.10-SNAPSHOT • Will be fixed to next releases of Cascading and Flink  Check Github for details: http://github.com/dataartisans/cascading-flink 7
  8. 8. Executing Cascading on Flink  Cascading programs are translated into Flink programs  Execution leverages all runtime features • Memory-safe execution • In-memory operators • Pipelining • Native serializers & binary comparators (if program provides data types)  Use Flink’s regular execution clients 8
  9. 9. Current limitations  HashJoin only supported as InnerJoin • HashJoin can be replaced by CoGroup  Support will be added once Flink supports hash-based outer joins • This is work in progress 9
  10. 10. How to run Cascading on Flink  No binaries available yet  • Clone the repository • And build it (mvn –DskipTests clean install)  Add the cascading-flink Maven dependency to your Cascading project  Change just one line of code in your Cascading program • Replace Hadoop2MR1FlowConnector by FlinkConnector • Do not change any application logic (except replacing HashJoin for non-InnerJoins)  Execute Cascading program as regular Flink program  Detailed instructions on Github 10
  11. 11. Example: TF-IDF  Taken from “Cascading for the impatient” • 2 CoGroup, 7 GroupBy, 1 HashJoin 11http://docs.cascading.org/impatient
  12. 12. TF-IDF on MapReduce  Cascading on MapReduce translates the TF-IDF program to 9 MapReduce jobs  Each job • Reads data from HDFS • Applies a Map function • Shuffles the data over the network • Sorts the data • Applies a Reduce function • Writes the data to HDFS 12
  13. 13. TF-IDF on Flink  Cascading on Flink translates the TF- IDF job into one Flink job 13
  14. 14. TF-IDF on Flink  Shuffle is pipelined  Intermediate results are not written to or read from HDFS 14
  15. 15. TF-IDF: MR vs. Flink  8 worker node • 8 CPUs, 30GB RAM, 2 local SSDs  Hadoop 2.7.1 (YARN, HDFS, MapReduce)  Flink 0.10-SNAPSHOT  80GB data (intermediate data larger) 15 Cascading on Flink -> 3:24h Cascading on MapReduce -> 8:33h
  16. 16. Conclusion  Executing Cascading jobs on Apache Flink • Improves runtime • Reduces parameter tuning and avoids failures • Virtually no code changes  Apache Flink’s runtime is very versatile • Apache Hadoop MR • Apache Storm • Google Dataflow • Apache Samoa (incubating) • + Flink’s own APIs and libraries… 16

×