SlideShare a Scribd company logo
Cascading on Flink
Fabian Hueske
@fhueske
What is Cascading?
“Cascading is the proven application
development platform for building data
applications on Hadoop.”
(www.cascading.org)
 Java API for large-scale batch processing
 Programs are specified as data flows
• pipes, taps, flow, cascade, …
• each, groupBy, every, coGroup, merge, …
 Originally for Hadoop MapReduce
• Compiled to workflows of Hadoop MapReduce jobs
 Open Source (AL2)
• Developed by Concurrent
2
Why Cascading?
 Vastly simplified API compared to pure MR API
• Reuse of code, connecting flows, …
 Automatic translation to MR jobs
• Minimizes number of MR jobs
 Rock-solid execution due to Hadoop MapReduce
 More APIs have been put on top
• Scalding (Scala) by Twitter
• Cascalog (Datalog)
• Lingual (SQL)
• Fluent (fluent Java API)
 Runs in many production settings
• Twitter, Soundcloud, Etsy, Airbnb, …
3
Cascading Example
4
 Compute TF-IDF scores for a set of documents
• TF-IDF: Term-Frequency / Inverted-Document-Frequency
• Used for weighting the relevance of terms in search engines
 Building this against the MapReduce API is painful
Example taken from docs.cascading.org/impatient
Cascading 3.0
 Released in June 2015
 A new planner
• Execution backend can be changed
 Apache Tez executor
• Cascading programs are compiled to Tez jobs
• No identity mappers
• No writing to HDFS between jobs
5
Why Cascading on Flink?
 Flink’s unique batch processing runtime
• Pipelined data exchange
• Actively managed memory on- & off-heap
• Efficient in-memory & out-of-core operators
• Sorting and hashing on binary data
• No tuning for robust operation (OOME, GC)
 YARN integration
6
Cascading on Flink released
 Available on Github
• Apache License V2
 Depends on
• Cascading 3.1 WIP
• Flink 0.10-SNAPSHOT
• Will be fixed to next releases of Cascading and Flink
 Check Github for details:
http://github.com/dataartisans/cascading-flink
7
Executing Cascading on Flink
 Cascading programs are translated into Flink
programs
 Execution leverages all runtime features
• Memory-safe execution
• In-memory operators
• Pipelining
• Native serializers & binary comparators
(if program provides data types)
 Use Flink’s regular execution clients
8
Current limitations
 HashJoin only supported as InnerJoin
• HashJoin can be replaced by CoGroup
 Support will be added once Flink supports
hash-based outer joins
• This is work in progress
9
How to run Cascading on Flink
 No binaries available yet 
• Clone the repository
• And build it (mvn –DskipTests clean install)
 Add the cascading-flink Maven dependency to your
Cascading project
 Change just one line of code in your Cascading program
• Replace Hadoop2MR1FlowConnector by FlinkConnector
• Do not change any application logic (except replacing HashJoin
for non-InnerJoins)
 Execute Cascading program as regular Flink program
 Detailed instructions on Github
10
Example: TF-IDF
 Taken from “Cascading for the impatient”
• 2 CoGroup, 7 GroupBy, 1 HashJoin
11http://docs.cascading.org/impatient
TF-IDF on MapReduce
 Cascading on MapReduce translates the
TF-IDF program to 9 MapReduce jobs
 Each job
• Reads data from HDFS
• Applies a Map function
• Shuffles the data over the network
• Sorts the data
• Applies a Reduce function
• Writes the data to HDFS
12
TF-IDF on Flink
 Cascading on Flink translates the TF-
IDF job into one Flink job
13
TF-IDF on Flink
 Shuffle is pipelined
 Intermediate results are not
written to or read from HDFS
14
TF-IDF: MR vs. Flink
 8 worker node
• 8 CPUs, 30GB RAM, 2 local SSDs
 Hadoop 2.7.1 (YARN, HDFS, MapReduce)
 Flink 0.10-SNAPSHOT
 80GB data (intermediate data larger)
15
Cascading on Flink -> 3:24h
Cascading on MapReduce -> 8:33h
Conclusion
 Executing Cascading jobs on Apache Flink
• Improves runtime
• Reduces parameter tuning and avoids failures
• Virtually no code changes
 Apache Flink’s runtime is very versatile
• Apache Hadoop MR
• Apache Storm
• Google Dataflow
• Apache Samoa (incubating)
• + Flink’s own APIs and libraries…
16

More Related Content

What's hot

Overview of Apache Flink: the 4G of Big Data Analytics Frameworks
Overview of Apache Flink: the 4G of Big Data Analytics FrameworksOverview of Apache Flink: the 4G of Big Data Analytics Frameworks
Overview of Apache Flink: the 4G of Big Data Analytics FrameworksDataWorks Summit/Hadoop Summit
 
Hopsworks - Self-Service Spark/Flink/Kafka/Hadoop
Hopsworks - Self-Service Spark/Flink/Kafka/HadoopHopsworks - Self-Service Spark/Flink/Kafka/Hadoop
Hopsworks - Self-Service Spark/Flink/Kafka/HadoopJim Dowling
 
Kafka Summit SF 2017 - Streaming Processing in Python – 10 ways to avoid summ...
Kafka Summit SF 2017 - Streaming Processing in Python – 10 ways to avoid summ...Kafka Summit SF 2017 - Streaming Processing in Python – 10 ways to avoid summ...
Kafka Summit SF 2017 - Streaming Processing in Python – 10 ways to avoid summ...confluent
 
Apache Ratis - In Search of a Usable Raft Library
Apache Ratis - In Search of a Usable Raft LibraryApache Ratis - In Search of a Usable Raft Library
Apache Ratis - In Search of a Usable Raft LibraryTsz-Wo (Nicholas) Sze
 
Apache Kafka at LinkedIn
Apache Kafka at LinkedInApache Kafka at LinkedIn
Apache Kafka at LinkedInGuozhang Wang
 
Jim Dowling - Multi-tenant Flink-as-a-Service on YARN
Jim Dowling - Multi-tenant Flink-as-a-Service on YARN Jim Dowling - Multi-tenant Flink-as-a-Service on YARN
Jim Dowling - Multi-tenant Flink-as-a-Service on YARN Flink Forward
 
Stream All Things—Patterns of Modern Data Integration with Gwen Shapira
Stream All Things—Patterns of Modern Data Integration with Gwen ShapiraStream All Things—Patterns of Modern Data Integration with Gwen Shapira
Stream All Things—Patterns of Modern Data Integration with Gwen ShapiraDatabricks
 
Spark Summit EU talk by Jakub Hava
Spark Summit EU talk by Jakub HavaSpark Summit EU talk by Jakub Hava
Spark Summit EU talk by Jakub HavaSpark Summit
 
Spark Summit EU talk by Ruben Pulido Behar Veliqi
Spark Summit EU talk by Ruben Pulido Behar VeliqiSpark Summit EU talk by Ruben Pulido Behar Veliqi
Spark Summit EU talk by Ruben Pulido Behar VeliqiSpark Summit
 
HBaseConEast2016: Splice machine open source rdbms
HBaseConEast2016: Splice machine open source rdbmsHBaseConEast2016: Splice machine open source rdbms
HBaseConEast2016: Splice machine open source rdbmsMichael Stack
 
Emerging technologies /frameworks in Big Data
Emerging technologies /frameworks in Big DataEmerging technologies /frameworks in Big Data
Emerging technologies /frameworks in Big DataRahul Jain
 
Monitoring of GPU Usage with Tensorflow Models Using Prometheus
Monitoring of GPU Usage with Tensorflow Models Using PrometheusMonitoring of GPU Usage with Tensorflow Models Using Prometheus
Monitoring of GPU Usage with Tensorflow Models Using PrometheusDatabricks
 
Building Scalable Data Pipelines - 2016 DataPalooza Seattle
Building Scalable Data Pipelines - 2016 DataPalooza SeattleBuilding Scalable Data Pipelines - 2016 DataPalooza Seattle
Building Scalable Data Pipelines - 2016 DataPalooza SeattleEvan Chan
 
Juggling with Bits and Bytes - How Apache Flink operates on binary data
Juggling with Bits and Bytes - How Apache Flink operates on binary dataJuggling with Bits and Bytes - How Apache Flink operates on binary data
Juggling with Bits and Bytes - How Apache Flink operates on binary dataFabian Hueske
 
Avoiding big data antipatterns
Avoiding big data antipatternsAvoiding big data antipatterns
Avoiding big data antipatternsgrepalex
 
Spark optimization
Spark optimizationSpark optimization
Spark optimizationAnkit Beohar
 
Speed up Interactive Analytic Queries over Existing Big Data on Hadoop with P...
Speed up Interactive Analytic Queries over Existing Big Data on Hadoop with P...Speed up Interactive Analytic Queries over Existing Big Data on Hadoop with P...
Speed up Interactive Analytic Queries over Existing Big Data on Hadoop with P...viirya
 

What's hot (20)

Migrating pipelines into Docker
Migrating pipelines into DockerMigrating pipelines into Docker
Migrating pipelines into Docker
 
Overview of Apache Flink: the 4G of Big Data Analytics Frameworks
Overview of Apache Flink: the 4G of Big Data Analytics FrameworksOverview of Apache Flink: the 4G of Big Data Analytics Frameworks
Overview of Apache Flink: the 4G of Big Data Analytics Frameworks
 
Hopsworks - Self-Service Spark/Flink/Kafka/Hadoop
Hopsworks - Self-Service Spark/Flink/Kafka/HadoopHopsworks - Self-Service Spark/Flink/Kafka/Hadoop
Hopsworks - Self-Service Spark/Flink/Kafka/Hadoop
 
Data science lifecycle with Apache Zeppelin
Data science lifecycle with Apache ZeppelinData science lifecycle with Apache Zeppelin
Data science lifecycle with Apache Zeppelin
 
Kafka Summit SF 2017 - Streaming Processing in Python – 10 ways to avoid summ...
Kafka Summit SF 2017 - Streaming Processing in Python – 10 ways to avoid summ...Kafka Summit SF 2017 - Streaming Processing in Python – 10 ways to avoid summ...
Kafka Summit SF 2017 - Streaming Processing in Python – 10 ways to avoid summ...
 
Streaming in the Wild with Apache Flink
Streaming in the Wild with Apache FlinkStreaming in the Wild with Apache Flink
Streaming in the Wild with Apache Flink
 
Apache Ratis - In Search of a Usable Raft Library
Apache Ratis - In Search of a Usable Raft LibraryApache Ratis - In Search of a Usable Raft Library
Apache Ratis - In Search of a Usable Raft Library
 
Apache Kafka at LinkedIn
Apache Kafka at LinkedInApache Kafka at LinkedIn
Apache Kafka at LinkedIn
 
Jim Dowling - Multi-tenant Flink-as-a-Service on YARN
Jim Dowling - Multi-tenant Flink-as-a-Service on YARN Jim Dowling - Multi-tenant Flink-as-a-Service on YARN
Jim Dowling - Multi-tenant Flink-as-a-Service on YARN
 
Stream All Things—Patterns of Modern Data Integration with Gwen Shapira
Stream All Things—Patterns of Modern Data Integration with Gwen ShapiraStream All Things—Patterns of Modern Data Integration with Gwen Shapira
Stream All Things—Patterns of Modern Data Integration with Gwen Shapira
 
Spark Summit EU talk by Jakub Hava
Spark Summit EU talk by Jakub HavaSpark Summit EU talk by Jakub Hava
Spark Summit EU talk by Jakub Hava
 
Spark Summit EU talk by Ruben Pulido Behar Veliqi
Spark Summit EU talk by Ruben Pulido Behar VeliqiSpark Summit EU talk by Ruben Pulido Behar Veliqi
Spark Summit EU talk by Ruben Pulido Behar Veliqi
 
HBaseConEast2016: Splice machine open source rdbms
HBaseConEast2016: Splice machine open source rdbmsHBaseConEast2016: Splice machine open source rdbms
HBaseConEast2016: Splice machine open source rdbms
 
Emerging technologies /frameworks in Big Data
Emerging technologies /frameworks in Big DataEmerging technologies /frameworks in Big Data
Emerging technologies /frameworks in Big Data
 
Monitoring of GPU Usage with Tensorflow Models Using Prometheus
Monitoring of GPU Usage with Tensorflow Models Using PrometheusMonitoring of GPU Usage with Tensorflow Models Using Prometheus
Monitoring of GPU Usage with Tensorflow Models Using Prometheus
 
Building Scalable Data Pipelines - 2016 DataPalooza Seattle
Building Scalable Data Pipelines - 2016 DataPalooza SeattleBuilding Scalable Data Pipelines - 2016 DataPalooza Seattle
Building Scalable Data Pipelines - 2016 DataPalooza Seattle
 
Juggling with Bits and Bytes - How Apache Flink operates on binary data
Juggling with Bits and Bytes - How Apache Flink operates on binary dataJuggling with Bits and Bytes - How Apache Flink operates on binary data
Juggling with Bits and Bytes - How Apache Flink operates on binary data
 
Avoiding big data antipatterns
Avoiding big data antipatternsAvoiding big data antipatterns
Avoiding big data antipatterns
 
Spark optimization
Spark optimizationSpark optimization
Spark optimization
 
Speed up Interactive Analytic Queries over Existing Big Data on Hadoop with P...
Speed up Interactive Analytic Queries over Existing Big Data on Hadoop with P...Speed up Interactive Analytic Queries over Existing Big Data on Hadoop with P...
Speed up Interactive Analytic Queries over Existing Big Data on Hadoop with P...
 

Viewers also liked

Step-by-Step Introduction to Apache Flink
Step-by-Step Introduction to Apache Flink Step-by-Step Introduction to Apache Flink
Step-by-Step Introduction to Apache Flink Slim Baltagi
 
Fabian Hueske – Cascading on Flink
Fabian Hueske – Cascading on FlinkFabian Hueske – Cascading on Flink
Fabian Hueske – Cascading on FlinkFlink Forward
 
Apache Flink at Strata San Jose 2016
Apache Flink at Strata San Jose 2016Apache Flink at Strata San Jose 2016
Apache Flink at Strata San Jose 2016Kostas Tzoumas
 
Continuous Processing with Apache Flink - Strata London 2016
Continuous Processing with Apache Flink - Strata London 2016Continuous Processing with Apache Flink - Strata London 2016
Continuous Processing with Apache Flink - Strata London 2016Stephan Ewen
 
Click-Through Example for Flink’s KafkaConsumer Checkpointing
Click-Through Example for Flink’s KafkaConsumer CheckpointingClick-Through Example for Flink’s KafkaConsumer Checkpointing
Click-Through Example for Flink’s KafkaConsumer CheckpointingRobert Metzger
 
Sebastian Schelter – Distributed Machine Learing with the Samsara DSL
Sebastian Schelter – Distributed Machine Learing with the Samsara DSLSebastian Schelter – Distributed Machine Learing with the Samsara DSL
Sebastian Schelter – Distributed Machine Learing with the Samsara DSLFlink Forward
 
Spark For Faster Batch Processing
Spark For Faster Batch ProcessingSpark For Faster Batch Processing
Spark For Faster Batch ProcessingEdureka!
 
Building large scale applications in yarn with apache twill
Building large scale applications in yarn with apache twillBuilding large scale applications in yarn with apache twill
Building large scale applications in yarn with apache twillHenry Saputra
 
Harnessing the power of YARN with Apache Twill
Harnessing the power of YARN with Apache TwillHarnessing the power of YARN with Apache Twill
Harnessing the power of YARN with Apache TwillTerence Yim
 
Apache Flink: Real-World Use Cases for Streaming Analytics
Apache Flink: Real-World Use Cases for Streaming AnalyticsApache Flink: Real-World Use Cases for Streaming Analytics
Apache Flink: Real-World Use Cases for Streaming AnalyticsSlim Baltagi
 
Why apache Flink is the 4G of Big Data Analytics Frameworks
Why apache Flink is the 4G of Big Data Analytics FrameworksWhy apache Flink is the 4G of Big Data Analytics Frameworks
Why apache Flink is the 4G of Big Data Analytics FrameworksSlim Baltagi
 
Streaming using Kafka Flink & Elasticsearch
Streaming using Kafka Flink & ElasticsearchStreaming using Kafka Flink & Elasticsearch
Streaming using Kafka Flink & ElasticsearchKeira Zhou
 
Building Complex Data Workflows with Cascading on Hadoop
Building Complex Data Workflows with Cascading on HadoopBuilding Complex Data Workflows with Cascading on Hadoop
Building Complex Data Workflows with Cascading on HadoopGagan Agrawal
 
Unveiling FATA a Visual Journey.
Unveiling FATA a Visual Journey.Unveiling FATA a Visual Journey.
Unveiling FATA a Visual Journey.SRSP
 
Building Scale Free Applications with Hadoop and Cascading
Building Scale Free Applications with Hadoop and CascadingBuilding Scale Free Applications with Hadoop and Cascading
Building Scale Free Applications with Hadoop and Cascadingcwensel
 
Data flow in the data center
Data flow in the data centerData flow in the data center
Data flow in the data centerAdam Cataldo
 
Streaming Cyber Security into Graph: Accelerating Data into DataStax Graph an...
Streaming Cyber Security into Graph: Accelerating Data into DataStax Graph an...Streaming Cyber Security into Graph: Accelerating Data into DataStax Graph an...
Streaming Cyber Security into Graph: Accelerating Data into DataStax Graph an...Keith Kraus
 
QConSF 2014 talk on Netflix Mantis, a stream processing system
QConSF 2014 talk on Netflix Mantis, a stream processing systemQConSF 2014 talk on Netflix Mantis, a stream processing system
QConSF 2014 talk on Netflix Mantis, a stream processing systemDanny Yuan
 
Workshop Consul .- Service Discovery & Failure Detection
Workshop Consul .- Service Discovery & Failure DetectionWorkshop Consul .- Service Discovery & Failure Detection
Workshop Consul .- Service Discovery & Failure DetectionVincent Composieux
 

Viewers also liked (20)

Flink vs. Spark
Flink vs. SparkFlink vs. Spark
Flink vs. Spark
 
Step-by-Step Introduction to Apache Flink
Step-by-Step Introduction to Apache Flink Step-by-Step Introduction to Apache Flink
Step-by-Step Introduction to Apache Flink
 
Fabian Hueske – Cascading on Flink
Fabian Hueske – Cascading on FlinkFabian Hueske – Cascading on Flink
Fabian Hueske – Cascading on Flink
 
Apache Flink at Strata San Jose 2016
Apache Flink at Strata San Jose 2016Apache Flink at Strata San Jose 2016
Apache Flink at Strata San Jose 2016
 
Continuous Processing with Apache Flink - Strata London 2016
Continuous Processing with Apache Flink - Strata London 2016Continuous Processing with Apache Flink - Strata London 2016
Continuous Processing with Apache Flink - Strata London 2016
 
Click-Through Example for Flink’s KafkaConsumer Checkpointing
Click-Through Example for Flink’s KafkaConsumer CheckpointingClick-Through Example for Flink’s KafkaConsumer Checkpointing
Click-Through Example for Flink’s KafkaConsumer Checkpointing
 
Sebastian Schelter – Distributed Machine Learing with the Samsara DSL
Sebastian Schelter – Distributed Machine Learing with the Samsara DSLSebastian Schelter – Distributed Machine Learing with the Samsara DSL
Sebastian Schelter – Distributed Machine Learing with the Samsara DSL
 
Spark For Faster Batch Processing
Spark For Faster Batch ProcessingSpark For Faster Batch Processing
Spark For Faster Batch Processing
 
Building large scale applications in yarn with apache twill
Building large scale applications in yarn with apache twillBuilding large scale applications in yarn with apache twill
Building large scale applications in yarn with apache twill
 
Harnessing the power of YARN with Apache Twill
Harnessing the power of YARN with Apache TwillHarnessing the power of YARN with Apache Twill
Harnessing the power of YARN with Apache Twill
 
Apache Flink: Real-World Use Cases for Streaming Analytics
Apache Flink: Real-World Use Cases for Streaming AnalyticsApache Flink: Real-World Use Cases for Streaming Analytics
Apache Flink: Real-World Use Cases for Streaming Analytics
 
Why apache Flink is the 4G of Big Data Analytics Frameworks
Why apache Flink is the 4G of Big Data Analytics FrameworksWhy apache Flink is the 4G of Big Data Analytics Frameworks
Why apache Flink is the 4G of Big Data Analytics Frameworks
 
Streaming using Kafka Flink & Elasticsearch
Streaming using Kafka Flink & ElasticsearchStreaming using Kafka Flink & Elasticsearch
Streaming using Kafka Flink & Elasticsearch
 
Building Complex Data Workflows with Cascading on Hadoop
Building Complex Data Workflows with Cascading on HadoopBuilding Complex Data Workflows with Cascading on Hadoop
Building Complex Data Workflows with Cascading on Hadoop
 
Unveiling FATA a Visual Journey.
Unveiling FATA a Visual Journey.Unveiling FATA a Visual Journey.
Unveiling FATA a Visual Journey.
 
Building Scale Free Applications with Hadoop and Cascading
Building Scale Free Applications with Hadoop and CascadingBuilding Scale Free Applications with Hadoop and Cascading
Building Scale Free Applications with Hadoop and Cascading
 
Data flow in the data center
Data flow in the data centerData flow in the data center
Data flow in the data center
 
Streaming Cyber Security into Graph: Accelerating Data into DataStax Graph an...
Streaming Cyber Security into Graph: Accelerating Data into DataStax Graph an...Streaming Cyber Security into Graph: Accelerating Data into DataStax Graph an...
Streaming Cyber Security into Graph: Accelerating Data into DataStax Graph an...
 
QConSF 2014 talk on Netflix Mantis, a stream processing system
QConSF 2014 talk on Netflix Mantis, a stream processing systemQConSF 2014 talk on Netflix Mantis, a stream processing system
QConSF 2014 talk on Netflix Mantis, a stream processing system
 
Workshop Consul .- Service Discovery & Failure Detection
Workshop Consul .- Service Discovery & Failure DetectionWorkshop Consul .- Service Discovery & Failure Detection
Workshop Consul .- Service Discovery & Failure Detection
 

Similar to Overview of Cascading 3.0 on Apache Flink

Spark crash course workshop at Hadoop Summit
Spark crash course workshop at Hadoop SummitSpark crash course workshop at Hadoop Summit
Spark crash course workshop at Hadoop SummitDataWorks Summit
 
Large-Scale Stream Processing in the Hadoop Ecosystem
Large-Scale Stream Processing in the Hadoop Ecosystem Large-Scale Stream Processing in the Hadoop Ecosystem
Large-Scale Stream Processing in the Hadoop Ecosystem DataWorks Summit/Hadoop Summit
 
Large-Scale Stream Processing in the Hadoop Ecosystem - Hadoop Summit 2016
Large-Scale Stream Processing in the Hadoop Ecosystem - Hadoop Summit 2016Large-Scale Stream Processing in the Hadoop Ecosystem - Hadoop Summit 2016
Large-Scale Stream Processing in the Hadoop Ecosystem - Hadoop Summit 2016Gyula Fóra
 
Spy hard, challenges of 100G deep packet inspection on x86 platform
Spy hard, challenges of 100G deep packet inspection on x86 platformSpy hard, challenges of 100G deep packet inspection on x86 platform
Spy hard, challenges of 100G deep packet inspection on x86 platformRedge Technologies
 
Tez big datacamp-la-bikas_saha
Tez big datacamp-la-bikas_sahaTez big datacamp-la-bikas_saha
Tez big datacamp-la-bikas_sahaData Con LA
 
Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Djamel Zouaoui
 
Apache Tez -- A modern processing engine
Apache Tez -- A modern processing engineApache Tez -- A modern processing engine
Apache Tez -- A modern processing enginebigdatagurus_meetup
 
Vyacheslav Zholudev – Flink, a Convenient Abstraction Layer for Yarn?
Vyacheslav Zholudev – Flink, a Convenient Abstraction Layer for Yarn?Vyacheslav Zholudev – Flink, a Convenient Abstraction Layer for Yarn?
Vyacheslav Zholudev – Flink, a Convenient Abstraction Layer for Yarn?Flink Forward
 
Spark Advanced Analytics NJ Data Science Meetup - Princeton University
Spark Advanced Analytics NJ Data Science Meetup - Princeton UniversitySpark Advanced Analytics NJ Data Science Meetup - Princeton University
Spark Advanced Analytics NJ Data Science Meetup - Princeton UniversityAlex Zeltov
 
Apache Tez - A unifying Framework for Hadoop Data Processing
Apache Tez - A unifying Framework for Hadoop Data ProcessingApache Tez - A unifying Framework for Hadoop Data Processing
Apache Tez - A unifying Framework for Hadoop Data ProcessingDataWorks Summit
 
Ingesting hdfs intosolrusingsparktrimmed
Ingesting hdfs intosolrusingsparktrimmedIngesting hdfs intosolrusingsparktrimmed
Ingesting hdfs intosolrusingsparktrimmedwhoschek
 
Big Data Day LA 2015 - What's new and next in Apache Tez by Bikas Saha of Hor...
Big Data Day LA 2015 - What's new and next in Apache Tez by Bikas Saha of Hor...Big Data Day LA 2015 - What's new and next in Apache Tez by Bikas Saha of Hor...
Big Data Day LA 2015 - What's new and next in Apache Tez by Bikas Saha of Hor...Data Con LA
 
Hortonworks tech workshop in-memory processing with spark
Hortonworks tech workshop   in-memory processing with sparkHortonworks tech workshop   in-memory processing with spark
Hortonworks tech workshop in-memory processing with sparkHortonworks
 
Apache Spark Fundamentals
Apache Spark FundamentalsApache Spark Fundamentals
Apache Spark FundamentalsZahra Eskandari
 
Visual Mapping of Clickstream Data
Visual Mapping of Clickstream DataVisual Mapping of Clickstream Data
Visual Mapping of Clickstream DataDataWorks Summit
 
Etu Solution Day 2014 Track-D: 掌握Impala和Spark
Etu Solution Day 2014 Track-D: 掌握Impala和SparkEtu Solution Day 2014 Track-D: 掌握Impala和Spark
Etu Solution Day 2014 Track-D: 掌握Impala和SparkJames Chen
 
Sumedh Wale's presentation
Sumedh Wale's presentationSumedh Wale's presentation
Sumedh Wale's presentationpunesparkmeetup
 
Spark Streaming& Kafka-The Future of Stream Processing by Hari Shreedharan of...
Spark Streaming& Kafka-The Future of Stream Processing by Hari Shreedharan of...Spark Streaming& Kafka-The Future of Stream Processing by Hari Shreedharan of...
Spark Streaming& Kafka-The Future of Stream Processing by Hari Shreedharan of...Data Con LA
 
Spark Streaming & Kafka-The Future of Stream Processing
Spark Streaming & Kafka-The Future of Stream ProcessingSpark Streaming & Kafka-The Future of Stream Processing
Spark Streaming & Kafka-The Future of Stream ProcessingJack Gudenkauf
 

Similar to Overview of Cascading 3.0 on Apache Flink (20)

Spark crash course workshop at Hadoop Summit
Spark crash course workshop at Hadoop SummitSpark crash course workshop at Hadoop Summit
Spark crash course workshop at Hadoop Summit
 
Apache Spark on HDinsight Training
Apache Spark on HDinsight TrainingApache Spark on HDinsight Training
Apache Spark on HDinsight Training
 
Large-Scale Stream Processing in the Hadoop Ecosystem
Large-Scale Stream Processing in the Hadoop Ecosystem Large-Scale Stream Processing in the Hadoop Ecosystem
Large-Scale Stream Processing in the Hadoop Ecosystem
 
Large-Scale Stream Processing in the Hadoop Ecosystem - Hadoop Summit 2016
Large-Scale Stream Processing in the Hadoop Ecosystem - Hadoop Summit 2016Large-Scale Stream Processing in the Hadoop Ecosystem - Hadoop Summit 2016
Large-Scale Stream Processing in the Hadoop Ecosystem - Hadoop Summit 2016
 
Spy hard, challenges of 100G deep packet inspection on x86 platform
Spy hard, challenges of 100G deep packet inspection on x86 platformSpy hard, challenges of 100G deep packet inspection on x86 platform
Spy hard, challenges of 100G deep packet inspection on x86 platform
 
Tez big datacamp-la-bikas_saha
Tez big datacamp-la-bikas_sahaTez big datacamp-la-bikas_saha
Tez big datacamp-la-bikas_saha
 
Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming
 
Apache Tez -- A modern processing engine
Apache Tez -- A modern processing engineApache Tez -- A modern processing engine
Apache Tez -- A modern processing engine
 
Vyacheslav Zholudev – Flink, a Convenient Abstraction Layer for Yarn?
Vyacheslav Zholudev – Flink, a Convenient Abstraction Layer for Yarn?Vyacheslav Zholudev – Flink, a Convenient Abstraction Layer for Yarn?
Vyacheslav Zholudev – Flink, a Convenient Abstraction Layer for Yarn?
 
Spark Advanced Analytics NJ Data Science Meetup - Princeton University
Spark Advanced Analytics NJ Data Science Meetup - Princeton UniversitySpark Advanced Analytics NJ Data Science Meetup - Princeton University
Spark Advanced Analytics NJ Data Science Meetup - Princeton University
 
Apache Tez - A unifying Framework for Hadoop Data Processing
Apache Tez - A unifying Framework for Hadoop Data ProcessingApache Tez - A unifying Framework for Hadoop Data Processing
Apache Tez - A unifying Framework for Hadoop Data Processing
 
Ingesting hdfs intosolrusingsparktrimmed
Ingesting hdfs intosolrusingsparktrimmedIngesting hdfs intosolrusingsparktrimmed
Ingesting hdfs intosolrusingsparktrimmed
 
Big Data Day LA 2015 - What's new and next in Apache Tez by Bikas Saha of Hor...
Big Data Day LA 2015 - What's new and next in Apache Tez by Bikas Saha of Hor...Big Data Day LA 2015 - What's new and next in Apache Tez by Bikas Saha of Hor...
Big Data Day LA 2015 - What's new and next in Apache Tez by Bikas Saha of Hor...
 
Hortonworks tech workshop in-memory processing with spark
Hortonworks tech workshop   in-memory processing with sparkHortonworks tech workshop   in-memory processing with spark
Hortonworks tech workshop in-memory processing with spark
 
Apache Spark Fundamentals
Apache Spark FundamentalsApache Spark Fundamentals
Apache Spark Fundamentals
 
Visual Mapping of Clickstream Data
Visual Mapping of Clickstream DataVisual Mapping of Clickstream Data
Visual Mapping of Clickstream Data
 
Etu Solution Day 2014 Track-D: 掌握Impala和Spark
Etu Solution Day 2014 Track-D: 掌握Impala和SparkEtu Solution Day 2014 Track-D: 掌握Impala和Spark
Etu Solution Day 2014 Track-D: 掌握Impala和Spark
 
Sumedh Wale's presentation
Sumedh Wale's presentationSumedh Wale's presentation
Sumedh Wale's presentation
 
Spark Streaming& Kafka-The Future of Stream Processing by Hari Shreedharan of...
Spark Streaming& Kafka-The Future of Stream Processing by Hari Shreedharan of...Spark Streaming& Kafka-The Future of Stream Processing by Hari Shreedharan of...
Spark Streaming& Kafka-The Future of Stream Processing by Hari Shreedharan of...
 
Spark Streaming & Kafka-The Future of Stream Processing
Spark Streaming & Kafka-The Future of Stream ProcessingSpark Streaming & Kafka-The Future of Stream Processing
Spark Streaming & Kafka-The Future of Stream Processing
 

More from Cascading

Predicting Hospital Readmission Using Cascading
Predicting Hospital Readmission Using CascadingPredicting Hospital Readmission Using Cascading
Predicting Hospital Readmission Using CascadingCascading
 
Reducing Development Time for Production-Grade Hadoop Applications
Reducing Development Time for Production-Grade Hadoop ApplicationsReducing Development Time for Production-Grade Hadoop Applications
Reducing Development Time for Production-Grade Hadoop ApplicationsCascading
 
Cascading 2015 User Survey Results
Cascading 2015 User Survey ResultsCascading 2015 User Survey Results
Cascading 2015 User Survey ResultsCascading
 
Breathe new life into your data warehouse by offloading etl processes to hadoop
Breathe new life into your data warehouse by offloading etl processes to hadoopBreathe new life into your data warehouse by offloading etl processes to hadoop
Breathe new life into your data warehouse by offloading etl processes to hadoopCascading
 
How To Get Hadoop App Intelligence with Driven
How To Get Hadoop App Intelligence with DrivenHow To Get Hadoop App Intelligence with Driven
How To Get Hadoop App Intelligence with DrivenCascading
 
7 Best Practices for Achieving Operational Readiness on Hadoop with Driven an...
7 Best Practices for Achieving Operational Readiness on Hadoop with Driven an...7 Best Practices for Achieving Operational Readiness on Hadoop with Driven an...
7 Best Practices for Achieving Operational Readiness on Hadoop with Driven an...Cascading
 
The Cascading (big) data application framework - André Keple, Sr. Engineer, C...
The Cascading (big) data application framework - André Keple, Sr. Engineer, C...The Cascading (big) data application framework - André Keple, Sr. Engineer, C...
The Cascading (big) data application framework - André Keple, Sr. Engineer, C...Cascading
 
Elasticsearch + Cascading for Scalable Log Processing
Elasticsearch + Cascading for Scalable Log ProcessingElasticsearch + Cascading for Scalable Log Processing
Elasticsearch + Cascading for Scalable Log ProcessingCascading
 
Cascading concurrent yahoo lunch_nlearn
Cascading concurrent   yahoo lunch_nlearnCascading concurrent   yahoo lunch_nlearn
Cascading concurrent yahoo lunch_nlearnCascading
 
Introduction to Cascading
Introduction to Cascading  Introduction to Cascading
Introduction to Cascading Cascading
 
Accelerate Big Data Application Development with Cascading
Accelerate Big Data Application Development with CascadingAccelerate Big Data Application Development with Cascading
Accelerate Big Data Application Development with CascadingCascading
 

More from Cascading (11)

Predicting Hospital Readmission Using Cascading
Predicting Hospital Readmission Using CascadingPredicting Hospital Readmission Using Cascading
Predicting Hospital Readmission Using Cascading
 
Reducing Development Time for Production-Grade Hadoop Applications
Reducing Development Time for Production-Grade Hadoop ApplicationsReducing Development Time for Production-Grade Hadoop Applications
Reducing Development Time for Production-Grade Hadoop Applications
 
Cascading 2015 User Survey Results
Cascading 2015 User Survey ResultsCascading 2015 User Survey Results
Cascading 2015 User Survey Results
 
Breathe new life into your data warehouse by offloading etl processes to hadoop
Breathe new life into your data warehouse by offloading etl processes to hadoopBreathe new life into your data warehouse by offloading etl processes to hadoop
Breathe new life into your data warehouse by offloading etl processes to hadoop
 
How To Get Hadoop App Intelligence with Driven
How To Get Hadoop App Intelligence with DrivenHow To Get Hadoop App Intelligence with Driven
How To Get Hadoop App Intelligence with Driven
 
7 Best Practices for Achieving Operational Readiness on Hadoop with Driven an...
7 Best Practices for Achieving Operational Readiness on Hadoop with Driven an...7 Best Practices for Achieving Operational Readiness on Hadoop with Driven an...
7 Best Practices for Achieving Operational Readiness on Hadoop with Driven an...
 
The Cascading (big) data application framework - André Keple, Sr. Engineer, C...
The Cascading (big) data application framework - André Keple, Sr. Engineer, C...The Cascading (big) data application framework - André Keple, Sr. Engineer, C...
The Cascading (big) data application framework - André Keple, Sr. Engineer, C...
 
Elasticsearch + Cascading for Scalable Log Processing
Elasticsearch + Cascading for Scalable Log ProcessingElasticsearch + Cascading for Scalable Log Processing
Elasticsearch + Cascading for Scalable Log Processing
 
Cascading concurrent yahoo lunch_nlearn
Cascading concurrent   yahoo lunch_nlearnCascading concurrent   yahoo lunch_nlearn
Cascading concurrent yahoo lunch_nlearn
 
Introduction to Cascading
Introduction to Cascading  Introduction to Cascading
Introduction to Cascading
 
Accelerate Big Data Application Development with Cascading
Accelerate Big Data Application Development with CascadingAccelerate Big Data Application Development with Cascading
Accelerate Big Data Application Development with Cascading
 

Recently uploaded

Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Product School
 
Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)
Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)
Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)Julian Hyde
 
In-Depth Performance Testing Guide for IT Professionals
In-Depth Performance Testing Guide for IT ProfessionalsIn-Depth Performance Testing Guide for IT Professionals
In-Depth Performance Testing Guide for IT ProfessionalsExpeed Software
 
Demystifying gRPC in .Net by John Staveley
Demystifying gRPC in .Net by John StaveleyDemystifying gRPC in .Net by John Staveley
Demystifying gRPC in .Net by John StaveleyJohn Staveley
 
Custom Approval Process: A New Perspective, Pavel Hrbacek & Anindya Halder
Custom Approval Process: A New Perspective, Pavel Hrbacek & Anindya HalderCustom Approval Process: A New Perspective, Pavel Hrbacek & Anindya Halder
Custom Approval Process: A New Perspective, Pavel Hrbacek & Anindya HalderCzechDreamin
 
SOQL 201 for Admins & Developers: Slice & Dice Your Org’s Data With Aggregate...
SOQL 201 for Admins & Developers: Slice & Dice Your Org’s Data With Aggregate...SOQL 201 for Admins & Developers: Slice & Dice Your Org’s Data With Aggregate...
SOQL 201 for Admins & Developers: Slice & Dice Your Org’s Data With Aggregate...CzechDreamin
 
Exploring UiPath Orchestrator API: updates and limits in 2024 🚀
Exploring UiPath Orchestrator API: updates and limits in 2024 🚀Exploring UiPath Orchestrator API: updates and limits in 2024 🚀
Exploring UiPath Orchestrator API: updates and limits in 2024 🚀DianaGray10
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...Sri Ambati
 
Speed Wins: From Kafka to APIs in Minutes
Speed Wins: From Kafka to APIs in MinutesSpeed Wins: From Kafka to APIs in Minutes
Speed Wins: From Kafka to APIs in Minutesconfluent
 
Knowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backKnowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backElena Simperl
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...UiPathCommunity
 
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Thierry Lestable
 
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Product School
 
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsTo Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsPaul Groth
 
Salesforce Adoption – Metrics, Methods, and Motivation, Antone Kom
Salesforce Adoption – Metrics, Methods, and Motivation, Antone KomSalesforce Adoption – Metrics, Methods, and Motivation, Antone Kom
Salesforce Adoption – Metrics, Methods, and Motivation, Antone KomCzechDreamin
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...Product School
 
Behind the Scenes From the Manager's Chair: Decoding the Secrets of Successfu...
Behind the Scenes From the Manager's Chair: Decoding the Secrets of Successfu...Behind the Scenes From the Manager's Chair: Decoding the Secrets of Successfu...
Behind the Scenes From the Manager's Chair: Decoding the Secrets of Successfu...CzechDreamin
 
Powerful Start- the Key to Project Success, Barbara Laskowska
Powerful Start- the Key to Project Success, Barbara LaskowskaPowerful Start- the Key to Project Success, Barbara Laskowska
Powerful Start- the Key to Project Success, Barbara LaskowskaCzechDreamin
 
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Jeffrey Haguewood
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersSafe Software
 

Recently uploaded (20)

Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
 
Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)
Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)
Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)
 
In-Depth Performance Testing Guide for IT Professionals
In-Depth Performance Testing Guide for IT ProfessionalsIn-Depth Performance Testing Guide for IT Professionals
In-Depth Performance Testing Guide for IT Professionals
 
Demystifying gRPC in .Net by John Staveley
Demystifying gRPC in .Net by John StaveleyDemystifying gRPC in .Net by John Staveley
Demystifying gRPC in .Net by John Staveley
 
Custom Approval Process: A New Perspective, Pavel Hrbacek & Anindya Halder
Custom Approval Process: A New Perspective, Pavel Hrbacek & Anindya HalderCustom Approval Process: A New Perspective, Pavel Hrbacek & Anindya Halder
Custom Approval Process: A New Perspective, Pavel Hrbacek & Anindya Halder
 
SOQL 201 for Admins & Developers: Slice & Dice Your Org’s Data With Aggregate...
SOQL 201 for Admins & Developers: Slice & Dice Your Org’s Data With Aggregate...SOQL 201 for Admins & Developers: Slice & Dice Your Org’s Data With Aggregate...
SOQL 201 for Admins & Developers: Slice & Dice Your Org’s Data With Aggregate...
 
Exploring UiPath Orchestrator API: updates and limits in 2024 🚀
Exploring UiPath Orchestrator API: updates and limits in 2024 🚀Exploring UiPath Orchestrator API: updates and limits in 2024 🚀
Exploring UiPath Orchestrator API: updates and limits in 2024 🚀
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
 
Speed Wins: From Kafka to APIs in Minutes
Speed Wins: From Kafka to APIs in MinutesSpeed Wins: From Kafka to APIs in Minutes
Speed Wins: From Kafka to APIs in Minutes
 
Knowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backKnowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and back
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
 
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
 
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...
 
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsTo Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
 
Salesforce Adoption – Metrics, Methods, and Motivation, Antone Kom
Salesforce Adoption – Metrics, Methods, and Motivation, Antone KomSalesforce Adoption – Metrics, Methods, and Motivation, Antone Kom
Salesforce Adoption – Metrics, Methods, and Motivation, Antone Kom
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
 
Behind the Scenes From the Manager's Chair: Decoding the Secrets of Successfu...
Behind the Scenes From the Manager's Chair: Decoding the Secrets of Successfu...Behind the Scenes From the Manager's Chair: Decoding the Secrets of Successfu...
Behind the Scenes From the Manager's Chair: Decoding the Secrets of Successfu...
 
Powerful Start- the Key to Project Success, Barbara Laskowska
Powerful Start- the Key to Project Success, Barbara LaskowskaPowerful Start- the Key to Project Success, Barbara Laskowska
Powerful Start- the Key to Project Success, Barbara Laskowska
 
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
 

Overview of Cascading 3.0 on Apache Flink

  • 1. Cascading on Flink Fabian Hueske @fhueske
  • 2. What is Cascading? “Cascading is the proven application development platform for building data applications on Hadoop.” (www.cascading.org)  Java API for large-scale batch processing  Programs are specified as data flows • pipes, taps, flow, cascade, … • each, groupBy, every, coGroup, merge, …  Originally for Hadoop MapReduce • Compiled to workflows of Hadoop MapReduce jobs  Open Source (AL2) • Developed by Concurrent 2
  • 3. Why Cascading?  Vastly simplified API compared to pure MR API • Reuse of code, connecting flows, …  Automatic translation to MR jobs • Minimizes number of MR jobs  Rock-solid execution due to Hadoop MapReduce  More APIs have been put on top • Scalding (Scala) by Twitter • Cascalog (Datalog) • Lingual (SQL) • Fluent (fluent Java API)  Runs in many production settings • Twitter, Soundcloud, Etsy, Airbnb, … 3
  • 4. Cascading Example 4  Compute TF-IDF scores for a set of documents • TF-IDF: Term-Frequency / Inverted-Document-Frequency • Used for weighting the relevance of terms in search engines  Building this against the MapReduce API is painful Example taken from docs.cascading.org/impatient
  • 5. Cascading 3.0  Released in June 2015  A new planner • Execution backend can be changed  Apache Tez executor • Cascading programs are compiled to Tez jobs • No identity mappers • No writing to HDFS between jobs 5
  • 6. Why Cascading on Flink?  Flink’s unique batch processing runtime • Pipelined data exchange • Actively managed memory on- & off-heap • Efficient in-memory & out-of-core operators • Sorting and hashing on binary data • No tuning for robust operation (OOME, GC)  YARN integration 6
  • 7. Cascading on Flink released  Available on Github • Apache License V2  Depends on • Cascading 3.1 WIP • Flink 0.10-SNAPSHOT • Will be fixed to next releases of Cascading and Flink  Check Github for details: http://github.com/dataartisans/cascading-flink 7
  • 8. Executing Cascading on Flink  Cascading programs are translated into Flink programs  Execution leverages all runtime features • Memory-safe execution • In-memory operators • Pipelining • Native serializers & binary comparators (if program provides data types)  Use Flink’s regular execution clients 8
  • 9. Current limitations  HashJoin only supported as InnerJoin • HashJoin can be replaced by CoGroup  Support will be added once Flink supports hash-based outer joins • This is work in progress 9
  • 10. How to run Cascading on Flink  No binaries available yet  • Clone the repository • And build it (mvn –DskipTests clean install)  Add the cascading-flink Maven dependency to your Cascading project  Change just one line of code in your Cascading program • Replace Hadoop2MR1FlowConnector by FlinkConnector • Do not change any application logic (except replacing HashJoin for non-InnerJoins)  Execute Cascading program as regular Flink program  Detailed instructions on Github 10
  • 11. Example: TF-IDF  Taken from “Cascading for the impatient” • 2 CoGroup, 7 GroupBy, 1 HashJoin 11http://docs.cascading.org/impatient
  • 12. TF-IDF on MapReduce  Cascading on MapReduce translates the TF-IDF program to 9 MapReduce jobs  Each job • Reads data from HDFS • Applies a Map function • Shuffles the data over the network • Sorts the data • Applies a Reduce function • Writes the data to HDFS 12
  • 13. TF-IDF on Flink  Cascading on Flink translates the TF- IDF job into one Flink job 13
  • 14. TF-IDF on Flink  Shuffle is pipelined  Intermediate results are not written to or read from HDFS 14
  • 15. TF-IDF: MR vs. Flink  8 worker node • 8 CPUs, 30GB RAM, 2 local SSDs  Hadoop 2.7.1 (YARN, HDFS, MapReduce)  Flink 0.10-SNAPSHOT  80GB data (intermediate data larger) 15 Cascading on Flink -> 3:24h Cascading on MapReduce -> 8:33h
  • 16. Conclusion  Executing Cascading jobs on Apache Flink • Improves runtime • Reduces parameter tuning and avoids failures • Virtually no code changes  Apache Flink’s runtime is very versatile • Apache Hadoop MR • Apache Storm • Google Dataflow • Apache Samoa (incubating) • + Flink’s own APIs and libraries… 16