Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Flink Forward Berlin 2017: Aljoscha Krettek - Talk Python to me: Stream Processing in Your Favourite Language with Beam on Flink


Published on

Flink is a great stream processor, Python is a great programming language, Apache Beam is a great programming model and portability layer. Using all three together is a great idea! We will demo and discuss writing Beam Python pipelines and running them on Flink. We will cover Beam's portability vision that led here, what you need to know about how Beam Python pipelines are executed on Flink, and where Beam's portability framework is headed next (hint: Python pipelines reading from non-Python connectors)

Published in: Data & Analytics
  • Be the first to comment

Flink Forward Berlin 2017: Aljoscha Krettek - Talk Python to me: Stream Processing in Your Favourite Language with Beam on Flink

  1. 1. Talk Python to Me Stream Processing in Your Favourite Language with Beam on Flink Apache Beam Slides by Aljoscha Krettek, September 2017, Flink Forward 2017 Apache Flink Based on work and slides by Frances Perry, Tyler Akidau, Kenneth Knowles & Sourabh Bajaj
  2. 2. 2 Agenda 1. What is Beam? 2. The Beam Portability APIs (Fn / Pipeline) 3. Executing Pythonic Beam Jobs on Flink 4. The Future
  3. 3. 33 What is Beam?
  4. 4. The Evolution of Apache Beam MapReduce BigTable DremelColossus FlumeMegastoreSpanner PubSub Millwheel Apache Beam Google Cloud Dataflow
  5. 5. 5 Beam Model: Generations Beyond MapReduce Improved abstractions let you focus on your application logic Batch and stream processing are both first-class citizens -- no need to choose. Clearly separates event time from processing time.
  6. 6. 6 The Apache Beam Vision 1. End users: who want to write pipelines in a language that’s familiar. 2. SDK writers: who want to make Beam concepts available in new languages. 3. Runner writers: who have a distributed processing environment and want to support Beam pipelines Beam Model: Fn Runners Apache Flink Apache Spark Beam Model: Pipeline Construction Other LanguagesBeam Java Beam Python Execution Execution Cloud Dataflow Execution
  7. 7. 7 The Beam Model (Flink draws it more like this)
  8. 8. 8 The Beam Model Pipeline PTransform PCollection (bounded or unbounded)
  9. 9. 9 Beam Model: Asking the Right Questions What results are calculated? Where in event time are results calculated? When in processing time are results materialized? How do refinements of results relate?
  10. 10. The Beam Model: What is Being Computed? PCollection<KV<String, Integer>> scores = input .apply(Sum.integersPerKey()); scores= (input | Sum.integersPerKey())
  11. 11. The Beam Model: What is Being Computed?
  12. 12. The Beam Model: Where in Event Time? PCollection<KV<String, Integer>> scores = input .apply(Window.into(FixedWindows.of(Duration.standardMinutes(2))) .apply(Sum.integersPerKey()); scores= (input | beam.WindowInto(FixedWindows(2 * 60)) | Sum.integersPerKey())
  13. 13. The Beam Model: Where in Event Time?
  14. 14. The Beam Model: When in Processing Time? PCollection<KV<String, Integer>> scores = input .apply(Window.into(FixedWindows.of(Duration.standardMinutes(2)) .triggering(AtWatermark())) .apply(Sum.integersPerKey()); scores = (input | beam.WindowInto(FixedWindows(2 * 60) .triggering(AtWatermark())) | Sum.integersPerKey())
  15. 15. The Beam Model: When in Processing Time?
  16. 16. The Beam Model: How Do Refinements Relate? PCollection<KV<String, Integer>> scores = input .apply(Window.into(FixedWindows.of(Duration.standardMinutes(2)) .triggering(AtWatermark() .withEarlyFirings(AtPeriod(Duration.standardMinutes(1))) .withLateFirings(AtCount(1))) .accumulatingFiredPanes()) .apply(Sum.integersPerKey()); scores = (input | beam.WindowInto(FixedWindows(2 * 60) .triggering(AtWatermark() .withEarlyFirings(AtPeriod(1 * 60)) .withLateFirings(AtCount(1))) .accumulatingFiredPanes()) | Sum.integersPerKey())
  17. 17. The Beam Model: How Do Refinements Relate?
  18. 18. 18 Customizing What Where When How 3 Streaming 4 Streaming + Accumulation 1 Classic Batch For more information see 2 Windowed Batch
  19. 19. A Complete Example of Pythonic Beam Code import apache_beam as beam with beam.Pipeline() as p: (p |"twitter_topic") | beam.WindowInto(SlidingWindows(5*60, 1*60)) | beam.ParDo(ParseHashTagDoFn()) | beam.combiners.Count.PerElement() | beam.ParDo(BigQueryOutputFormatDoFn()) |"trends_table"))
  20. 20. What is Apache Beam? 1. The Beam Model: What / Where / When / How 2. SDKs for writing Beam pipelines 3. Runners for Existing Distributed Processing Backends ○ Apache Apex ○ Apache Flink ○ Apache Spark ○ Google Cloud Dataflow ○ Local (in-process) runner for testing
  21. 21. 2121 Beam Portability APIs (Pipeline / Job / Fn)
  22. 22. 22 What are we trying to solve? ● Executing user code written in an arbitrary language (Python) on a Runner written in a different language (Java) ● Mixing user functions written in different languages (Connectors, Sources, Sinks, …)
  23. 23. 23 Terminology Beam Model Describes the API concepts and the possible operations on PCollections. Pipeline User-defined graph of transformations on PCollections. This is constructed using a Beam SDK. The transformations can contain UDFs. Runner Executes a Pipeline. For example: FlinkRunner. Beam SDK Language specific library/framework for creating programs that use the Beam Model. Allows defining Pipelines and UDFs and provides APIs for executing them. User-defined function (UDF) Code in Java, Python, … that specifies how data is transformed. For example DoFn or CombineFn.
  24. 24. 24 Executing a Beam Pipeline - The Big Picture SDK User Pipeline Pipeline API Runner Worker Fn API Job API
  25. 25. 25 APIs for Different Pipeline Lifecycle Stages Pipeline API ● Used by the SDK to construct SDK-agnostic Pipeline representation ● Used by the Runner to translate a Pipeline to runner-specific operations Fn API ● Used by an SDK harness for communication with a Runner ● User by the Runner to push work into an SDK harness Job API ● (API for interacting with a running Pipeline)
  26. 26. 26 Pipeline API (simplified) ● Definition of common primitive transformations (Read, ParDo, Flatten, Window.into, GroupByKey) ● Definition of serialized Pipeline (protobuf) i Pipeline = {PCollection*, PTransform*, WindowingStrategy*, Coder*} PTransform = {Inputs*, Outputs*, FunctionSpec} FunctionSpec = {URN, payload}
  27. 27. 27 Job API public interface JobApi { State getState(); // RUNNING, DONE, CANCELED, FAILED ... State cancel() throws IOException; State waitUntilFinish(Duration duration); State waitUntilFinish(); MetricResults metrics(); }
  28. 28. 28 Fn API ● gRPC interface definitions for communication between an SDK harness and a Runner ● Control: Used to tell the SDK which UDFs to execute and when to execute them. ● Data: Used to move data between the language specific SDK harness and the runner. ● State: Used to support user state, side inputs, and group by key reiteration. ● Logging: Used to aggregate logging information from the language specific SDK harness.
  29. 29. 29 Fn API (continued)
  30. 30. 30 Fn API - Bundle Processing e
  31. 31. 31 Fn API - Processing DoFns Say we need to execute this part
  32. 32. 32 Fn API - Processing DoFns Python DoFn Python DoFn
  33. 33. 33 Fn API - Processing DoFns (Pipeline manipulation) Python DoFn Python DoFn gRPC Source gRPC Sink The Runner inserts these
  34. 34. 34 Fn API - Executing the user Fn using a SDK Harness ● We can execute as a separate process ● We can execute in a Docker container Worker Fn API ● Repository of containers for different SDKs ● We inject the user code into the container when starting ● Container is user-configurable
  35. 35. 3535 Executing Pythonic* Beam Jobs on Fink *or other languages
  36. 36. 36 What is the (Flink) Runner/Flink doing in all this? ● Analyze/transform the Pipeline (Pipeline API) ● Create a Flink Job (DataSet/DataStream API) ● Ship the user code/docker container description ● In an operator: Open gRPC services for control/data/logging/state plane ● Execute arbitrary user code using the Fn API Easy, because Flink state/timers map well to Beam concepts!
  37. 37. 37 Advantages/Disadvantages ● Complete isolation of user code ● Complete configurability of execution environment (with Docker) ● We can support code written in arbitrary languages ● We can mix user code written in different languages ● Slower (RPC overhead) ● Using Docker requires docker
  38. 38. 3838 The Future
  39. 39. 39 Future work ● Finish what I just talked about ● Finalize the different APIs (not Flink-specific) ● Mixing and matching connectors written in different languages ● Wait for new SDKs in other languages, they will just work
  40. 40. 40 Learn More! Apache Beam/Apache Flink / Beam Fn API design documents Join the mailing lists! / / Follow @ApacheFlink / @ApacheBeam on Twitter
  41. 41. 41 Thank you!
  42. 42. 4242 Backup Slides
  43. 43. 43 Processing Time vs. Event Time