Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Fabian Hueske – Cascading on Flink

6,858 views

Published on

Flink Forward 2015

Published in: Technology
  • Be the first to comment

Fabian Hueske – Cascading on Flink

  1. 1. Cascading on Flink Fabian Hueske @fhueske
  2. 2. What is Cascading? “Cascading is the proven application development platform for building data applications on Hadoop.” (www.cascading.org)  Java API for large-scale batch processing  Programs are specified as data flows • pipes, taps, flow, cascade, … • each, groupBy, every, coGroup, merge, …  Open Source (AL2) • Developed by Concurrent 2
  3. 3. Cascading on MapReduce  Originally for Hadoop MapReduce  Much better API than MapReduce • DAG programming model • Higher-level operators (join, coGroup, merge) • Composable and reusable code  Automatic translation to MapReduce jobs • Minimizes number of MapReduce jobs  Rock-solid execution due to Hadoop MapReduce 3
  4. 4. Cascading Example 4  Compute TF-IDF scores for a set of documents • TF-IDF: Term-Frequency / Inverted-Document-Frequency • Used for weighting the relevance of terms in search engines  Building this against the MapReduce API is painful! Example taken from docs.cascading.org/impatient
  5. 5. Who uses Cascading?  Runs in many production environments • Twitter, Soundcloud, Etsy, Airbnb, …  More APIs have been put on top • Scalding (Scala) by Twitter • Cascalog (Datalog) • Lingual (SQL) • Fluent (fluent Java API) 5
  6. 6. Cascading 3.0  Released in June 2015  A new planner • Execution backend can be changed  Apache Tez executor • Cascading programs are compiled to Tez jobs • No identity mappers • No writing to HDFS between jobs 6
  7. 7. Cascading on Flink 7
  8. 8. Why Cascading on Flink?  Flink’s unique batch processing runtime • Pipelined data exchange • Actively managed memory on- & off-heap • Efficient in-memory & out-of-core operators • Sorting and hashing on binary data • No tuning for robust operation (OOME, GC)  YARN integration 8
  9. 9. Cascading on Flink Released  Available on Github • Apache License V2  Depends on • Cascading 3.1 WIP • Flink 0.10-SNAPSHOT • Will be fixed to next releases of Cascading and Flink  Check Github for details: http://github.com/dataartisans/cascading-flink 9
  10. 10. Translation Details 10
  11. 11. Flow Translation  Implemented on top of Java DataSet API  Using Cascading’s rule-based planner • Flow is compiled into a single Flink job • The operators of a job are partitioned into nodes • Chaining of operators  Translation rules partition the flow if • Data is shuffled • Data is processed by Flink’s internal operators • Flows branch or merge • At sources and sinks 11
  12. 12. Operator Translation 12  Cascading operators have fixed execution strategy • No degree of freedom for Flink’s optimizer • Strategies fixed using hints for Flink’s optimizer Cascading Operator Flink Operator(s) (n-ary) GroupBy (Union -) Reduce (n-ary) BufferJoin CoGroup (Union -) Reduce (n-ary) CoGroup (Sequence of) binary hash-partitioned, sorted OuterJoin (n-ary) HashJoin (Sequence of) binary Broadcasted HashJoin (n-ary) Merge n-ary Union Tap Source or Sink
  13. 13. Serializers & Comparators  Flink needs information about all processed data types • Generation of serializer and comparators  Cascading supports • Schema-less tuples (no length, no types) • Definition of key fields by name and (relative) position • Null-valued fields and key fields  Custom type information for Cascading tuples • Native serializers & comparators for fields with known type • Kryo for unknown field types • Support for null values by wrapping serializers & comparators 13
  14. 14. Going Out-of-Core  Join and CoGroup must hold data in memory • If data exceeds memory, we need to go to disk  Cascading on MR uses spillable collections • Spill to disk if #elements > threshold • Part of Cascading (not MapReduce) • Threshold either too low or too high  Cascading on Flink uses Flink’s Join and OuterJoin • Part of Flink (not Cascading) • Backed by Flink’s manage memory • Transparently spill to disk if necessary 14
  15. 15. Running Cascading on Flink 15
  16. 16. How to Run Cascading on Flink  Add the cascading-flink Maven dependency to your Cascading project • Available in Sonatype Nexus Repository • Or build it from source (Github)  Change just one line of code in your Cascading program • Replace Hadoop2MR1FlowConnector by FlinkConnector • Do not change any application logic  Execute Cascading program as regular Flink program  Detailed instructions on Github 16
  17. 17. (Preliminary!) Performance Evaluation  8 worker node • 8 CPUs, 30GB RAM, 2 local SSDs  Hadoop 2.7.1 (YARN, HDFS, MapReduce)  Flink 0.10-SNAPSHOT  80GB generated text data 17
  18. 18. Baseline Wordcount 18 0 2 4 6 8 10 12 14 16 MapReduce native Flink native Cascading on MR Cascading on Flink Execution Time (min)  Cascading on MR compiled to 1 MR job • Similar execution strategy (hash-partition, sort) • No significant speed gain expected  Verifies of our implementation  Hash-Aggregators!
  19. 19. Something more complex: TF-IDF  Taken from “Cascading for the impatient” • 2 CoGroup, 7 GroupBy, 1 HashJoin 19http://docs.cascading.org/impatient
  20. 20. TF-IDF on MapReduce  Cascading on MapReduce translates the TF-IDF program to 9 MapReduce jobs  Each job • Reads data from HDFS • Applies a Map function • Shuffles the data over the network • Sorts the data • Applies a Reduce function • Writes the data to HDFS 20
  21. 21. TF-IDF on Flink  Cascading on Flink translates the TF- IDF job into one Flink job 21
  22. 22. TF-IDF on Flink  Shuffle is pipelined  Intermediate results are not written to or read from HDFS 22 0 200 400 600 Cascading on MR Cascading on Flink Execution Time (min)
  23. 23. Conclusion  Executing Cascading jobs on Apache Flink • Improves runtime • Reduces parameter tuning and avoids failures • Virtually no code changes  Apache Flink’s runtime is very versatile • Apache Hadoop MR • Apache Storm • Google Dataflow • Apache Samoa (incubating) • + Flink’s own APIs and libraries… 23
  24. 24. 24

×