Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Overview of Cascading 3.0 on Apache Flink

6,263 views

Published on

Overview of Cascading 3.0 and Apache Flink including how-to examples

Published in: Technology

Overview of Cascading 3.0 on Apache Flink

  1. 1. Cascading on Flink Fabian Hueske @fhueske
  2. 2. What is Cascading? “Cascading is the proven application development platform for building data applications on Hadoop.” (www.cascading.org)  Java API for large-scale batch processing  Programs are specified as data flows • pipes, taps, flow, cascade, … • each, groupBy, every, coGroup, merge, …  Originally for Hadoop MapReduce • Compiled to workflows of Hadoop MapReduce jobs  Open Source (AL2) • Developed by Concurrent 2
  3. 3. Why Cascading?  Vastly simplified API compared to pure MR API • Reuse of code, connecting flows, …  Automatic translation to MR jobs • Minimizes number of MR jobs  Rock-solid execution due to Hadoop MapReduce  More APIs have been put on top • Scalding (Scala) by Twitter • Cascalog (Datalog) • Lingual (SQL) • Fluent (fluent Java API)  Runs in many production settings • Twitter, Soundcloud, Etsy, Airbnb, … 3
  4. 4. Cascading Example 4  Compute TF-IDF scores for a set of documents • TF-IDF: Term-Frequency / Inverted-Document-Frequency • Used for weighting the relevance of terms in search engines  Building this against the MapReduce API is painful Example taken from docs.cascading.org/impatient
  5. 5. Cascading 3.0  Released in June 2015  A new planner • Execution backend can be changed  Apache Tez executor • Cascading programs are compiled to Tez jobs • No identity mappers • No writing to HDFS between jobs 5
  6. 6. Why Cascading on Flink?  Flink’s unique batch processing runtime • Pipelined data exchange • Actively managed memory on- & off-heap • Efficient in-memory & out-of-core operators • Sorting and hashing on binary data • No tuning for robust operation (OOME, GC)  YARN integration 6
  7. 7. Cascading on Flink released  Available on Github • Apache License V2  Depends on • Cascading 3.1 WIP • Flink 0.10-SNAPSHOT • Will be fixed to next releases of Cascading and Flink  Check Github for details: http://github.com/dataartisans/cascading-flink 7
  8. 8. Executing Cascading on Flink  Cascading programs are translated into Flink programs  Execution leverages all runtime features • Memory-safe execution • In-memory operators • Pipelining • Native serializers & binary comparators (if program provides data types)  Use Flink’s regular execution clients 8
  9. 9. Current limitations  HashJoin only supported as InnerJoin • HashJoin can be replaced by CoGroup  Support will be added once Flink supports hash-based outer joins • This is work in progress 9
  10. 10. How to run Cascading on Flink  No binaries available yet  • Clone the repository • And build it (mvn –DskipTests clean install)  Add the cascading-flink Maven dependency to your Cascading project  Change just one line of code in your Cascading program • Replace Hadoop2MR1FlowConnector by FlinkConnector • Do not change any application logic (except replacing HashJoin for non-InnerJoins)  Execute Cascading program as regular Flink program  Detailed instructions on Github 10
  11. 11. Example: TF-IDF  Taken from “Cascading for the impatient” • 2 CoGroup, 7 GroupBy, 1 HashJoin 11http://docs.cascading.org/impatient
  12. 12. TF-IDF on MapReduce  Cascading on MapReduce translates the TF-IDF program to 9 MapReduce jobs  Each job • Reads data from HDFS • Applies a Map function • Shuffles the data over the network • Sorts the data • Applies a Reduce function • Writes the data to HDFS 12
  13. 13. TF-IDF on Flink  Cascading on Flink translates the TF- IDF job into one Flink job 13
  14. 14. TF-IDF on Flink  Shuffle is pipelined  Intermediate results are not written to or read from HDFS 14
  15. 15. TF-IDF: MR vs. Flink  8 worker node • 8 CPUs, 30GB RAM, 2 local SSDs  Hadoop 2.7.1 (YARN, HDFS, MapReduce)  Flink 0.10-SNAPSHOT  80GB data (intermediate data larger) 15 Cascading on Flink -> 3:24h Cascading on MapReduce -> 8:33h
  16. 16. Conclusion  Executing Cascading jobs on Apache Flink • Improves runtime • Reduces parameter tuning and avoids failures • Virtually no code changes  Apache Flink’s runtime is very versatile • Apache Hadoop MR • Apache Storm • Google Dataflow • Apache Samoa (incubating) • + Flink’s own APIs and libraries… 16

×