Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Flink and Beam:
Current State & Roadmap
Maximilian Michels
PMC Apache Flink and Apache Beam
FlinkForward 2016
Apache Flink...
2
Google Cloud Dataflow On Top of Apache Flink®
Last year’s talk:
This year’s talk:
Apache Flink® and Apache Beam: Current...
Two projects, one idea
• Dec 2014 Google releases the Cloud Dataflow Java SDK
• Feb 2016 Cloud Dataflow Java SDK becomes Apa...
Apache Beam
(incubating)
• Jan 2016 Google proposes project to the Apache incubator
• Feb 2016 Project enters incubation
•...
The Dataflow/Beam Model
• 2004 MapReduce
• 2010 Flume Java (Beam’s API)
• 2013 MillWheel (Watermarks, exactly-once,
Windows...
Beam Runners
• A Runner ports Beam code to a backend
• A Runner’s concern is
• a) translation
• b) runtime execution
• Run...
Flink Runner - Status Quo
• Stable: 0.2.0-incubating
• Powered by Flink 1.0.3
• Development: 0.3.0-incubating-SNAPSHOT
• P...
Evolution of the Flink Runner
2014
Dec
• Google Cloud Dataflow SDK released
2015
Jan
• Initial commit and drafting
Mar
• Ba...
Batched vs Streamed
Execution
• Batched Execution
• Built on top of the DataSet API
• Supports only bounded sources
• Mana...
The Flink Runner’s Strength
• Backed by a runtime which is streaming and batch aware
• Embeds and reuses Beam logic whenev...
Capability Matrix
• Compares Runners with the reference model
• What results are being calculated?
• Where in event time?
...
What
12
Where
13
When
14
How
15
So we’re good?
• Flink Runner currently the most advanced Runner
which is backed by an open-source engine
• Does that mean...
Roadmap Flink Runner
• Side Input streaming: size restrictions
• Integrate with new PipelineResult
• Dynamic Scaling
• Inc...
Apache Flink Features
powering the Runner
• Streaming
• Checkpointing / Savepoints
• State backends
• Batch
• Pipelined ex...
Demo
• Develop a Beam program with the Flink Runner
• Monitor applications using the web interface
• Handle failures by re...
Thank you for your
attention!
Upcoming SlideShare
Loading in …5
×

Maximilian Michels - Flink and Beam

1,307 views

Published on

http://flink-forward.org/kb_sessions/flink-and-beam-current-state-roadmap/

It is no secret that the Dataflow model, which evolved from Google’s MapReduce, Flume, and MillWheel, has been a major influence to Apache Flink’s streaming API. The essentials of this model are captured in Apache Beam. Beam provides the Dataflow API with the option to deploy to various backends (e.g. Flink, Spark). In this talk we will examine the current state of the Flink Runner. Beam’s Runners manage the translation of the Beam API into the backend API. The Beam project itself has made an effort to summarize the capabilities of each Runner to provide an overview of the supported API concepts. From all open sources backends, Flink is currently the Runner which supports the most features. We will look at the supported Beam features and their counterpart in Flink. Further, we will look at potential improvements and upcoming features of the Flink Runner.

Published in: Data & Analytics
  • Be the first to comment

Maximilian Michels - Flink and Beam

  1. 1. Flink and Beam: Current State & Roadmap Maximilian Michels PMC Apache Flink and Apache Beam FlinkForward 2016 Apache Flink® Apache Beam (incubating)
  2. 2. 2 Google Cloud Dataflow On Top of Apache Flink® Last year’s talk: This year’s talk: Apache Flink® and Apache Beam: Current State & Roadmap What happened?
  3. 3. Two projects, one idea • Dec 2014 Google releases the Cloud Dataflow Java SDK • Feb 2016 Cloud Dataflow Java SDK becomes Apache Beam 3 MapReduce BigTable DremelColossus FlumeMegastoreSpanner PubSub Millwheel Apache Beam Google Cloud Dataflow Taken from the official Beam material
  4. 4. Apache Beam (incubating) • Jan 2016 Google proposes project to the Apache incubator • Feb 2016 Project enters incubation • Jun 2016 Apache Beam 0.1.0-incubating released • Jul 2016 Apache Beam 0.2.0-incubating released 4 Dataflow Java 1.x Apache Beam Java 0.x Apache Beam Java 2.x Bug Fix Feature Breaking Change
  5. 5. The Dataflow/Beam Model • 2004 MapReduce • 2010 Flume Java (Beam’s API) • 2013 MillWheel (Watermarks, exactly-once, Windows) • Sep 2015 Dataflow Model published at VLDB • Nov 2015 Apache Flink 0.10.0 DataStream API features Event Time in line with the Dataflow model 5
  6. 6. Beam Runners • A Runner ports Beam code to a backend • A Runner’s concern is • a) translation • b) runtime execution • Runners available: • Google Cloud Dataflow • Apache Flink • Apache Spark • In Development: Gearpump, Apache Apex 6 Beam Model: Fn Runners Apache Flink Apache Spark Beam Model: Pipeline Construction Other LanguagesBeam Java Beam Python Execution Execution Cloud Dataflow Execution Taken from the official Beam material
  7. 7. Flink Runner - Status Quo • Stable: 0.2.0-incubating • Powered by Flink 1.0.3 • Development: 0.3.0-incubating-SNAPSHOT • Powered by Flink 1.1.2 • Changes coming in regularly, use the snapshot version to get the latest features 7
  8. 8. Evolution of the Flink Runner 2014 Dec • Google Cloud Dataflow SDK released 2015 Jan • Initial commit and drafting Mar • Batch support without Windows Dec • Streaming support with Watermarks and Windows 8 2016 Mar • Contribution to Apache Beam May • Parallel and checkpointed sources in streaming • Batch support with Windows and Side Inputs • RunnableOnService for batch Aug • Side Inputs for streaming • RunnableOnService for streaming
  9. 9. Batched vs Streamed Execution • Batched Execution • Built on top of the DataSet API • Supports only bounded sources • Managed Memory • Streamed Execution • Built on top of the DataStream API • Supports bounded and unbounded sources • Watermark support which leverages full Dataflow Model • Checkpointing • When to pick which? If undecided, just let the Runner decide 9
  10. 10. The Flink Runner’s Strength • Backed by a runtime which is streaming and batch aware • Embeds and reuses Beam logic whenever possible • Window and Triggering • Source API • State Internals / Checkpointing • Whenever possible, translate primitive transforms • Integration with Beam’s RunnableOnService tests (batch/ streaming) 10
  11. 11. Capability Matrix • Compares Runners with the reference model • What results are being calculated? • Where in event time? • When in processing time? • How do refinements of results relate? 11 For more information see http://beam.incubator.apache.org/learn/runners/capability-matrix/
  12. 12. What 12
  13. 13. Where 13
  14. 14. When 14
  15. 15. How 15
  16. 16. So we’re good? • Flink Runner currently the most advanced Runner which is backed by an open-source engine • Does that mean the Flink Runner is perfect? 16
  17. 17. Roadmap Flink Runner • Side Input streaming: size restrictions • Integrate with new PipelineResult • Dynamic Scaling • Incremental Checkpointing • Connectors (Kafka 8/9, Cassandra, Elasticsearch, RabbitMQ, Redis, NiFi, Kinesis) • Libraries (Gelly, CEP, Storm Compatibility, ML) • Performance testing • Bug fixes 17
  18. 18. Apache Flink Features powering the Runner • Streaming • Checkpointing / Savepoints • State backends • Batch • Pipelined execution • Managed Memory • High Availability • Flink’s native sources / sinks 18
  19. 19. Demo • Develop a Beam program with the Flink Runner • Monitor applications using the web interface • Handle failures by restoring from checkpointed state • Create a Savepoint, stop the Beam program, resume execution from it 19
  20. 20. Thank you for your attention!

×