SlideShare a Scribd company logo
1 of 43
Talk Python to Me
Stream Processing in Your Favourite Language with Beam on Flink
Apache Beam
Slides by Aljoscha Krettek, September 2017, Flink Forward 2017
Apache Flink
Based on work and slides by Frances Perry, Tyler Akidau, Kenneth
Knowles & Sourabh Bajaj
2
Agenda
1. What is Beam?
2. The Beam Portability APIs (Fn / Pipeline)
3. Executing Pythonic Beam Jobs on Flink
4. The Future
33
What is Beam?
The Evolution of Apache Beam
MapReduce
BigTable DremelColossus
FlumeMegastoreSpanner
PubSub
Millwheel
Apache
Beam
Google Cloud
Dataflow
5
Beam Model: Generations Beyond MapReduce
Improved abstractions let you focus
on your application logic
Batch and stream processing are
both first-class citizens -- no need to
choose.
Clearly separates event time from
processing time.
6
The Apache Beam Vision
1. End users: who want to write
pipelines in a language that’s familiar.
2. SDK writers: who want to make Beam
concepts available in new languages.
3. Runner writers: who have a
distributed processing environment
and want to support Beam pipelines
Beam Model: Fn Runners
Apache
Flink
Apache
Spark
Beam Model: Pipeline Construction
Other
LanguagesBeam Java
Beam
Python
Execution Execution
Cloud
Dataflow
Execution
7
The Beam Model
(Flink draws it more like this)
8
The Beam Model
Pipeline
PTransform
PCollection
(bounded or unbounded)
9
Beam Model: Asking the Right Questions
What results are calculated?
Where in event time are results calculated?
When in processing time are results materialized?
How do refinements of results relate?
The Beam Model: What is Being Computed?
PCollection<KV<String, Integer>> scores = input
.apply(Sum.integersPerKey());
scores= (input
| Sum.integersPerKey())
The Beam Model: What is Being Computed?
The Beam Model: Where in Event Time?
PCollection<KV<String, Integer>> scores = input
.apply(Window.into(FixedWindows.of(Duration.standardMinutes(2)))
.apply(Sum.integersPerKey());
scores= (input
| beam.WindowInto(FixedWindows(2 * 60))
| Sum.integersPerKey())
The Beam Model: Where in Event Time?
The Beam Model: When in Processing Time?
PCollection<KV<String, Integer>> scores = input
.apply(Window.into(FixedWindows.of(Duration.standardMinutes(2))
.triggering(AtWatermark()))
.apply(Sum.integersPerKey());
scores = (input
| beam.WindowInto(FixedWindows(2 * 60)
.triggering(AtWatermark()))
| Sum.integersPerKey())
The Beam Model: When in Processing Time?
The Beam Model: How Do Refinements Relate?
PCollection<KV<String, Integer>> scores = input
.apply(Window.into(FixedWindows.of(Duration.standardMinutes(2))
.triggering(AtWatermark()
.withEarlyFirings(AtPeriod(Duration.standardMinutes(1)))
.withLateFirings(AtCount(1)))
.accumulatingFiredPanes())
.apply(Sum.integersPerKey());
scores = (input
| beam.WindowInto(FixedWindows(2 * 60)
.triggering(AtWatermark()
.withEarlyFirings(AtPeriod(1 * 60))
.withLateFirings(AtCount(1)))
.accumulatingFiredPanes())
| Sum.integersPerKey())
The Beam Model: How Do Refinements Relate?
18
Customizing What Where When How
3
Streaming
4
Streaming
+ Accumulation
1
Classic
Batch
For more information see https://cloud.google.com/dataflow/examples/gaming-example
2
Windowed
Batch
A Complete Example of Pythonic Beam Code
import apache_beam as beam
with beam.Pipeline() as p:
(p
| beam.io.ReadStringsFromPubSub("twitter_topic")
| beam.WindowInto(SlidingWindows(5*60, 1*60))
| beam.ParDo(ParseHashTagDoFn())
| beam.combiners.Count.PerElement()
| beam.ParDo(BigQueryOutputFormatDoFn())
| beam.io.WriteToBigQuery("trends_table"))
What is Apache Beam?
1. The Beam Model: What / Where / When / How
2. SDKs for writing Beam pipelines
3. Runners for Existing Distributed Processing Backends
○ Apache Apex
○ Apache Flink
○ Apache Spark
○ Google Cloud Dataflow
○ Local (in-process) runner for testing
2121
Beam Portability APIs (Pipeline / Job / Fn)
22
What are we trying to solve?
● Executing user code written in an arbitrary language (Python) on a Runner
written in a different language (Java)
● Mixing user functions written in different languages (Connectors, Sources,
Sinks, …)
23
Terminology
Beam Model
Describes the API concepts and the
possible operations on
PCollections.
Pipeline
User-defined graph of
transformations on PCollections.
This is constructed using a Beam
SDK. The transformations can
contain UDFs.
Runner
Executes a Pipeline. For example:
FlinkRunner.
Beam SDK
Language specific
library/framework for creating
programs that use the Beam Model.
Allows defining Pipelines and UDFs
and provides APIs for executing
them.
User-defined function (UDF)
Code in Java, Python, … that
specifies how data is transformed.
For example DoFn or CombineFn.
24
Executing a Beam Pipeline - The Big Picture
SDK
User
Pipeline
Pipeline
API
Runner
Worker
Fn
API
Job
API
25
APIs for Different Pipeline Lifecycle Stages
Pipeline API
● Used by the SDK to construct
SDK-agnostic Pipeline
representation
● Used by the Runner to
translate a Pipeline to
runner-specific operations
Fn API
● Used by an SDK harness for
communication with a Runner
● User by the Runner to push
work into an SDK harness
Job API
● (API for interacting with a
running Pipeline)
26
Pipeline API (simplified)
● Definition of common primitive transformations
(Read, ParDo, Flatten, Window.into, GroupByKey)
● Definition of serialized Pipeline (protobuf)
https://s.apache.org/beam-runner-ap
i
Pipeline = {PCollection*, PTransform*, WindowingStrategy*,
Coder*}
PTransform = {Inputs*, Outputs*, FunctionSpec}
FunctionSpec = {URN, payload}
27
Job API
public interface JobApi {
State getState(); // RUNNING, DONE, CANCELED, FAILED ...
State cancel() throws IOException;
State waitUntilFinish(Duration duration);
State waitUntilFinish();
MetricResults metrics();
}
28
Fn API
● gRPC interface definitions for communication between an SDK harness
and a Runner
https://s.apache.org/beam-fn-api
● Control: Used to tell the SDK which UDFs to execute and when to execute
them.
● Data: Used to move data between the language specific SDK harness and
the runner.
● State: Used to support user state, side inputs, and group by key
reiteration.
● Logging: Used to aggregate logging information from the language
specific SDK harness.
29
Fn API (continued)
https://s.apache.org/beam-fn-api
30
Fn API - Bundle Processing
https://s.apache.org/beam-fn-api-processing-a-bundl
e
31
Fn API - Processing DoFns
https://s.apache.org/beam-fn-api-send-and-receive-data
Say we need to
execute this part
32
Fn API - Processing DoFns
https://s.apache.org/beam-fn-api-send-and-receive-data
Python DoFn
Python DoFn
33
Fn API - Processing DoFns (Pipeline manipulation)
https://s.apache.org/beam-fn-api-send-and-receive-data
Python DoFn
Python DoFn
gRPC Source
gRPC Sink
The Runner
inserts these
34
Fn API - Executing the user Fn using a SDK Harness
● We can execute as a separate process
● We can execute in a Docker container
Worker
Fn
API
https://s.apache.org/beam-fn-api-container-contract
● Repository of containers for different
SDKs
● We inject the user code into the
container when starting
● Container is user-configurable
3535
Executing Pythonic* Beam Jobs on Fink
*or other languages
36
What is the (Flink) Runner/Flink doing in all this?
● Analyze/transform the Pipeline (Pipeline API)
● Create a Flink Job (DataSet/DataStream API)
● Ship the user code/docker container description
● In an operator: Open gRPC services for control/data/logging/state
plane
● Execute arbitrary user code using the Fn API
Easy, because Flink state/timers map well to Beam concepts!
37
Advantages/Disadvantages
● Complete isolation of user
code
● Complete configurability of
execution environment (with
Docker)
● We can support code written in
arbitrary languages
● We can mix user code written
in different languages
● Slower (RPC overhead)
● Using Docker requires docker
3838
The Future
39
Future work
● Finish what I just talked about
● Finalize the different APIs (not Flink-specific)
● Mixing and matching connectors written in different languages
● Wait for new SDKs in other languages, they will just work
40
Learn More!
Apache Beam/Apache Flink
https://flink.apache.org / https://beam.apache.org
Beam Fn API design documents
https://s.apache.org/beam-runner-api
https://s.apache.org/beam-fn-api
https://s.apache.org/beam-fn-api-processing-a-bundle
https://s.apache.org/beam-fn-state-api-and-bundle-processing
https://s.apache.org/beam-fn-api-send-and-receive-data
https://s.apache.org/beam-fn-api-container-contract
Join the mailing lists!
user-subscribe@flink.apache.org / dev-subscribe@flink.apache.org
user-subscribe@beam.apache.org / dev-subscribe@beam.apache.org
Follow @ApacheFlink / @ApacheBeam on Twitter
41
Thank you!
4242
Backup Slides
43
Processing Time vs. Event Time

More Related Content

What's hot

Flink Forward Berlin 2017: Dominik Bruhn - Deploying Flink Jobs as Docker Con...
Flink Forward Berlin 2017: Dominik Bruhn - Deploying Flink Jobs as Docker Con...Flink Forward Berlin 2017: Dominik Bruhn - Deploying Flink Jobs as Docker Con...
Flink Forward Berlin 2017: Dominik Bruhn - Deploying Flink Jobs as Docker Con...Flink Forward
 
Virtual Flink Forward 2020: A deep dive into Flink SQL - Jark Wu
Virtual Flink Forward 2020: A deep dive into Flink SQL - Jark WuVirtual Flink Forward 2020: A deep dive into Flink SQL - Jark Wu
Virtual Flink Forward 2020: A deep dive into Flink SQL - Jark WuFlink Forward
 
Virtual Flink Forward 2020: Autoscaling Flink at Netflix - Timothy Farkas
Virtual Flink Forward 2020: Autoscaling Flink at Netflix - Timothy FarkasVirtual Flink Forward 2020: Autoscaling Flink at Netflix - Timothy Farkas
Virtual Flink Forward 2020: Autoscaling Flink at Netflix - Timothy FarkasFlink Forward
 
Flink Forward Berlin 2017: Patrick Lucas - Flink in Containerland
Flink Forward Berlin 2017: Patrick Lucas - Flink in ContainerlandFlink Forward Berlin 2017: Patrick Lucas - Flink in Containerland
Flink Forward Berlin 2017: Patrick Lucas - Flink in ContainerlandFlink Forward
 
Time to-live: How to Perform Automatic State Cleanup in Apache Flink - Andrey...
Time to-live: How to Perform Automatic State Cleanup in Apache Flink - Andrey...Time to-live: How to Perform Automatic State Cleanup in Apache Flink - Andrey...
Time to-live: How to Perform Automatic State Cleanup in Apache Flink - Andrey...Flink Forward
 
Flink Forward San Francisco 2019: Scaling a real-time streaming warehouse wit...
Flink Forward San Francisco 2019: Scaling a real-time streaming warehouse wit...Flink Forward San Francisco 2019: Scaling a real-time streaming warehouse wit...
Flink Forward San Francisco 2019: Scaling a real-time streaming warehouse wit...Flink Forward
 
Flink Forward SF 2017: Stefan Richter - Improvements for large state and reco...
Flink Forward SF 2017: Stefan Richter - Improvements for large state and reco...Flink Forward SF 2017: Stefan Richter - Improvements for large state and reco...
Flink Forward SF 2017: Stefan Richter - Improvements for large state and reco...Flink Forward
 
Flink Forward SF 2017: Cliff Resnick & Seth Wiesman - From Zero to Streami...
Flink Forward SF 2017:  Cliff Resnick & Seth Wiesman -   From Zero to Streami...Flink Forward SF 2017:  Cliff Resnick & Seth Wiesman -   From Zero to Streami...
Flink Forward SF 2017: Cliff Resnick & Seth Wiesman - From Zero to Streami...Flink Forward
 
Flink Forward Berlin 2017: Roberto Bentivoglio, Saverio Veltri - NSDB (Natura...
Flink Forward Berlin 2017: Roberto Bentivoglio, Saverio Veltri - NSDB (Natura...Flink Forward Berlin 2017: Roberto Bentivoglio, Saverio Veltri - NSDB (Natura...
Flink Forward Berlin 2017: Roberto Bentivoglio, Saverio Veltri - NSDB (Natura...Flink Forward
 
Till Rohrmann – Fault Tolerance and Job Recovery in Apache Flink
Till Rohrmann – Fault Tolerance and Job Recovery in Apache FlinkTill Rohrmann – Fault Tolerance and Job Recovery in Apache Flink
Till Rohrmann – Fault Tolerance and Job Recovery in Apache FlinkFlink Forward
 
Flink Forward San Francisco 2019: Developing and operating real-time applicat...
Flink Forward San Francisco 2019: Developing and operating real-time applicat...Flink Forward San Francisco 2019: Developing and operating real-time applicat...
Flink Forward San Francisco 2019: Developing and operating real-time applicat...Flink Forward
 
Flink Forward SF 2017: Stephan Ewen - Experiences running Flink at Very Large...
Flink Forward SF 2017: Stephan Ewen - Experiences running Flink at Very Large...Flink Forward SF 2017: Stephan Ewen - Experiences running Flink at Very Large...
Flink Forward SF 2017: Stephan Ewen - Experiences running Flink at Very Large...Flink Forward
 
Flink Forward SF 2017: Feng Wang & Zhijiang Wang - Runtime Improvements in Bl...
Flink Forward SF 2017: Feng Wang & Zhijiang Wang - Runtime Improvements in Bl...Flink Forward SF 2017: Feng Wang & Zhijiang Wang - Runtime Improvements in Bl...
Flink Forward SF 2017: Feng Wang & Zhijiang Wang - Runtime Improvements in Bl...Flink Forward
 
Future of Apache Flink Deployments: Containers, Kubernetes and More - Flink F...
Future of Apache Flink Deployments: Containers, Kubernetes and More - Flink F...Future of Apache Flink Deployments: Containers, Kubernetes and More - Flink F...
Future of Apache Flink Deployments: Containers, Kubernetes and More - Flink F...Till Rohrmann
 
Flink Forward Berlin 2017: Patrick Gunia - Migration of a realtime stats prod...
Flink Forward Berlin 2017: Patrick Gunia - Migration of a realtime stats prod...Flink Forward Berlin 2017: Patrick Gunia - Migration of a realtime stats prod...
Flink Forward Berlin 2017: Patrick Gunia - Migration of a realtime stats prod...Flink Forward
 
Streaming your Lyft Ride Prices - Flink Forward SF 2019
Streaming your Lyft Ride Prices - Flink Forward SF 2019Streaming your Lyft Ride Prices - Flink Forward SF 2019
Streaming your Lyft Ride Prices - Flink Forward SF 2019Thomas Weise
 
Virtual Flink Forward 2020: Build your next-generation stream platform based ...
Virtual Flink Forward 2020: Build your next-generation stream platform based ...Virtual Flink Forward 2020: Build your next-generation stream platform based ...
Virtual Flink Forward 2020: Build your next-generation stream platform based ...Flink Forward
 
Flink Forward SF 2017: Timo Walther - Table & SQL API – unified APIs for bat...
Flink Forward SF 2017: Timo Walther -  Table & SQL API – unified APIs for bat...Flink Forward SF 2017: Timo Walther -  Table & SQL API – unified APIs for bat...
Flink Forward SF 2017: Timo Walther - Table & SQL API – unified APIs for bat...Flink Forward
 
Flink Forward SF 2017: Jamie Grier - Apache Flink - The latest and greatest
Flink Forward SF 2017: Jamie Grier - Apache Flink - The latest and greatestFlink Forward SF 2017: Jamie Grier - Apache Flink - The latest and greatest
Flink Forward SF 2017: Jamie Grier - Apache Flink - The latest and greatestFlink Forward
 
Apache Flink at Strata San Jose 2016
Apache Flink at Strata San Jose 2016Apache Flink at Strata San Jose 2016
Apache Flink at Strata San Jose 2016Kostas Tzoumas
 

What's hot (20)

Flink Forward Berlin 2017: Dominik Bruhn - Deploying Flink Jobs as Docker Con...
Flink Forward Berlin 2017: Dominik Bruhn - Deploying Flink Jobs as Docker Con...Flink Forward Berlin 2017: Dominik Bruhn - Deploying Flink Jobs as Docker Con...
Flink Forward Berlin 2017: Dominik Bruhn - Deploying Flink Jobs as Docker Con...
 
Virtual Flink Forward 2020: A deep dive into Flink SQL - Jark Wu
Virtual Flink Forward 2020: A deep dive into Flink SQL - Jark WuVirtual Flink Forward 2020: A deep dive into Flink SQL - Jark Wu
Virtual Flink Forward 2020: A deep dive into Flink SQL - Jark Wu
 
Virtual Flink Forward 2020: Autoscaling Flink at Netflix - Timothy Farkas
Virtual Flink Forward 2020: Autoscaling Flink at Netflix - Timothy FarkasVirtual Flink Forward 2020: Autoscaling Flink at Netflix - Timothy Farkas
Virtual Flink Forward 2020: Autoscaling Flink at Netflix - Timothy Farkas
 
Flink Forward Berlin 2017: Patrick Lucas - Flink in Containerland
Flink Forward Berlin 2017: Patrick Lucas - Flink in ContainerlandFlink Forward Berlin 2017: Patrick Lucas - Flink in Containerland
Flink Forward Berlin 2017: Patrick Lucas - Flink in Containerland
 
Time to-live: How to Perform Automatic State Cleanup in Apache Flink - Andrey...
Time to-live: How to Perform Automatic State Cleanup in Apache Flink - Andrey...Time to-live: How to Perform Automatic State Cleanup in Apache Flink - Andrey...
Time to-live: How to Perform Automatic State Cleanup in Apache Flink - Andrey...
 
Flink Forward San Francisco 2019: Scaling a real-time streaming warehouse wit...
Flink Forward San Francisco 2019: Scaling a real-time streaming warehouse wit...Flink Forward San Francisco 2019: Scaling a real-time streaming warehouse wit...
Flink Forward San Francisco 2019: Scaling a real-time streaming warehouse wit...
 
Flink Forward SF 2017: Stefan Richter - Improvements for large state and reco...
Flink Forward SF 2017: Stefan Richter - Improvements for large state and reco...Flink Forward SF 2017: Stefan Richter - Improvements for large state and reco...
Flink Forward SF 2017: Stefan Richter - Improvements for large state and reco...
 
Flink Forward SF 2017: Cliff Resnick & Seth Wiesman - From Zero to Streami...
Flink Forward SF 2017:  Cliff Resnick & Seth Wiesman -   From Zero to Streami...Flink Forward SF 2017:  Cliff Resnick & Seth Wiesman -   From Zero to Streami...
Flink Forward SF 2017: Cliff Resnick & Seth Wiesman - From Zero to Streami...
 
Flink Forward Berlin 2017: Roberto Bentivoglio, Saverio Veltri - NSDB (Natura...
Flink Forward Berlin 2017: Roberto Bentivoglio, Saverio Veltri - NSDB (Natura...Flink Forward Berlin 2017: Roberto Bentivoglio, Saverio Veltri - NSDB (Natura...
Flink Forward Berlin 2017: Roberto Bentivoglio, Saverio Veltri - NSDB (Natura...
 
Till Rohrmann – Fault Tolerance and Job Recovery in Apache Flink
Till Rohrmann – Fault Tolerance and Job Recovery in Apache FlinkTill Rohrmann – Fault Tolerance and Job Recovery in Apache Flink
Till Rohrmann – Fault Tolerance and Job Recovery in Apache Flink
 
Flink Forward San Francisco 2019: Developing and operating real-time applicat...
Flink Forward San Francisco 2019: Developing and operating real-time applicat...Flink Forward San Francisco 2019: Developing and operating real-time applicat...
Flink Forward San Francisco 2019: Developing and operating real-time applicat...
 
Flink Forward SF 2017: Stephan Ewen - Experiences running Flink at Very Large...
Flink Forward SF 2017: Stephan Ewen - Experiences running Flink at Very Large...Flink Forward SF 2017: Stephan Ewen - Experiences running Flink at Very Large...
Flink Forward SF 2017: Stephan Ewen - Experiences running Flink at Very Large...
 
Flink Forward SF 2017: Feng Wang & Zhijiang Wang - Runtime Improvements in Bl...
Flink Forward SF 2017: Feng Wang & Zhijiang Wang - Runtime Improvements in Bl...Flink Forward SF 2017: Feng Wang & Zhijiang Wang - Runtime Improvements in Bl...
Flink Forward SF 2017: Feng Wang & Zhijiang Wang - Runtime Improvements in Bl...
 
Future of Apache Flink Deployments: Containers, Kubernetes and More - Flink F...
Future of Apache Flink Deployments: Containers, Kubernetes and More - Flink F...Future of Apache Flink Deployments: Containers, Kubernetes and More - Flink F...
Future of Apache Flink Deployments: Containers, Kubernetes and More - Flink F...
 
Flink Forward Berlin 2017: Patrick Gunia - Migration of a realtime stats prod...
Flink Forward Berlin 2017: Patrick Gunia - Migration of a realtime stats prod...Flink Forward Berlin 2017: Patrick Gunia - Migration of a realtime stats prod...
Flink Forward Berlin 2017: Patrick Gunia - Migration of a realtime stats prod...
 
Streaming your Lyft Ride Prices - Flink Forward SF 2019
Streaming your Lyft Ride Prices - Flink Forward SF 2019Streaming your Lyft Ride Prices - Flink Forward SF 2019
Streaming your Lyft Ride Prices - Flink Forward SF 2019
 
Virtual Flink Forward 2020: Build your next-generation stream platform based ...
Virtual Flink Forward 2020: Build your next-generation stream platform based ...Virtual Flink Forward 2020: Build your next-generation stream platform based ...
Virtual Flink Forward 2020: Build your next-generation stream platform based ...
 
Flink Forward SF 2017: Timo Walther - Table & SQL API – unified APIs for bat...
Flink Forward SF 2017: Timo Walther -  Table & SQL API – unified APIs for bat...Flink Forward SF 2017: Timo Walther -  Table & SQL API – unified APIs for bat...
Flink Forward SF 2017: Timo Walther - Table & SQL API – unified APIs for bat...
 
Flink Forward SF 2017: Jamie Grier - Apache Flink - The latest and greatest
Flink Forward SF 2017: Jamie Grier - Apache Flink - The latest and greatestFlink Forward SF 2017: Jamie Grier - Apache Flink - The latest and greatest
Flink Forward SF 2017: Jamie Grier - Apache Flink - The latest and greatest
 
Apache Flink at Strata San Jose 2016
Apache Flink at Strata San Jose 2016Apache Flink at Strata San Jose 2016
Apache Flink at Strata San Jose 2016
 

Similar to Flink Forward Berlin 2017: Aljoscha Krettek - Talk Python to me: Stream Processing in Your Favourite Language with Beam on Flink

Talk Python To Me: Stream Processing in your favourite Language with Beam on ...
Talk Python To Me: Stream Processing in your favourite Language with Beam on ...Talk Python To Me: Stream Processing in your favourite Language with Beam on ...
Talk Python To Me: Stream Processing in your favourite Language with Beam on ...Aljoscha Krettek
 
Flink Forward Berlin 2018: Thomas Weise & Aljoscha Krettek - "Python Streamin...
Flink Forward Berlin 2018: Thomas Weise & Aljoscha Krettek - "Python Streamin...Flink Forward Berlin 2018: Thomas Weise & Aljoscha Krettek - "Python Streamin...
Flink Forward Berlin 2018: Thomas Weise & Aljoscha Krettek - "Python Streamin...Flink Forward
 
Python Streaming Pipelines with Beam on Flink
Python Streaming Pipelines with Beam on FlinkPython Streaming Pipelines with Beam on Flink
Python Streaming Pipelines with Beam on FlinkAljoscha Krettek
 
Flink Forward Berlin 2018: Robert Bradshaw & Maximilian Michels - "Universal ...
Flink Forward Berlin 2018: Robert Bradshaw & Maximilian Michels - "Universal ...Flink Forward Berlin 2018: Robert Bradshaw & Maximilian Michels - "Universal ...
Flink Forward Berlin 2018: Robert Bradshaw & Maximilian Michels - "Universal ...Flink Forward
 
Introduction to Apache Beam & No Shard Left Behind: APIs for Massive Parallel...
Introduction to Apache Beam & No Shard Left Behind: APIs for Massive Parallel...Introduction to Apache Beam & No Shard Left Behind: APIs for Massive Parallel...
Introduction to Apache Beam & No Shard Left Behind: APIs for Massive Parallel...Dan Halperin
 
Python Streaming Pipelines on Flink - Beam Meetup at Lyft 2019
Python Streaming Pipelines on Flink - Beam Meetup at Lyft 2019Python Streaming Pipelines on Flink - Beam Meetup at Lyft 2019
Python Streaming Pipelines on Flink - Beam Meetup at Lyft 2019Thomas Weise
 
Flink Forward San Francisco 2019: Streaming your Lyft Ride Prices - Thomas We...
Flink Forward San Francisco 2019: Streaming your Lyft Ride Prices - Thomas We...Flink Forward San Francisco 2019: Streaming your Lyft Ride Prices - Thomas We...
Flink Forward San Francisco 2019: Streaming your Lyft Ride Prices - Thomas We...Flink Forward
 
Flink Forward San Francisco 2019: Streaming your Lyft Ride Prices - Thomas We...
Flink Forward San Francisco 2019: Streaming your Lyft Ride Prices - Thomas We...Flink Forward San Francisco 2019: Streaming your Lyft Ride Prices - Thomas We...
Flink Forward San Francisco 2019: Streaming your Lyft Ride Prices - Thomas We...Flink Forward
 
Realizing the Promise of Portable Data Processing with Apache Beam
Realizing the Promise of Portable Data Processing with Apache BeamRealizing the Promise of Portable Data Processing with Apache Beam
Realizing the Promise of Portable Data Processing with Apache BeamDataWorks Summit
 
Stream processing for the masses with beam, python and flink
Stream processing for the masses with beam, python and flink Stream processing for the masses with beam, python and flink
Stream processing for the masses with beam, python and flink Enrico Canzonieri
 
Apache Beam (incubating)
Apache Beam (incubating)Apache Beam (incubating)
Apache Beam (incubating)Apache Apex
 
Apache Pulsar Development 101 with Python
Apache Pulsar Development 101 with PythonApache Pulsar Development 101 with Python
Apache Pulsar Development 101 with PythonTimothy Spann
 
Stream-Native Processing with Pulsar Functions
Stream-Native Processing with Pulsar FunctionsStream-Native Processing with Pulsar Functions
Stream-Native Processing with Pulsar FunctionsStreamlio
 
apidays LIVE Helsinki - Implementing OpenAPI and GraphQL Services with gRPC b...
apidays LIVE Helsinki - Implementing OpenAPI and GraphQL Services with gRPC b...apidays LIVE Helsinki - Implementing OpenAPI and GraphQL Services with gRPC b...
apidays LIVE Helsinki - Implementing OpenAPI and GraphQL Services with gRPC b...apidays
 
Build Great Networked APIs with Swift, OpenAPI, and gRPC
Build Great Networked APIs with Swift, OpenAPI, and gRPCBuild Great Networked APIs with Swift, OpenAPI, and gRPC
Build Great Networked APIs with Swift, OpenAPI, and gRPCTim Burks
 
workshop_8_c__.pdf
workshop_8_c__.pdfworkshop_8_c__.pdf
workshop_8_c__.pdfAtulAvhad2
 
Realizing the promise of portable data processing with Apache Beam
Realizing the promise of portable data processing with Apache BeamRealizing the promise of portable data processing with Apache Beam
Realizing the promise of portable data processing with Apache BeamDataWorks Summit
 
OpenDaylight and YANG
OpenDaylight and YANGOpenDaylight and YANG
OpenDaylight and YANGCoreStack
 
Unified, Efficient, and Portable Data Processing with Apache Beam
Unified, Efficient, and Portable Data Processing with Apache BeamUnified, Efficient, and Portable Data Processing with Apache Beam
Unified, Efficient, and Portable Data Processing with Apache BeamDataWorks Summit/Hadoop Summit
 

Similar to Flink Forward Berlin 2017: Aljoscha Krettek - Talk Python to me: Stream Processing in Your Favourite Language with Beam on Flink (20)

Talk Python To Me: Stream Processing in your favourite Language with Beam on ...
Talk Python To Me: Stream Processing in your favourite Language with Beam on ...Talk Python To Me: Stream Processing in your favourite Language with Beam on ...
Talk Python To Me: Stream Processing in your favourite Language with Beam on ...
 
Flink Forward Berlin 2018: Thomas Weise & Aljoscha Krettek - "Python Streamin...
Flink Forward Berlin 2018: Thomas Weise & Aljoscha Krettek - "Python Streamin...Flink Forward Berlin 2018: Thomas Weise & Aljoscha Krettek - "Python Streamin...
Flink Forward Berlin 2018: Thomas Weise & Aljoscha Krettek - "Python Streamin...
 
Python Streaming Pipelines with Beam on Flink
Python Streaming Pipelines with Beam on FlinkPython Streaming Pipelines with Beam on Flink
Python Streaming Pipelines with Beam on Flink
 
Flink Forward Berlin 2018: Robert Bradshaw & Maximilian Michels - "Universal ...
Flink Forward Berlin 2018: Robert Bradshaw & Maximilian Michels - "Universal ...Flink Forward Berlin 2018: Robert Bradshaw & Maximilian Michels - "Universal ...
Flink Forward Berlin 2018: Robert Bradshaw & Maximilian Michels - "Universal ...
 
Introduction to Apache Beam & No Shard Left Behind: APIs for Massive Parallel...
Introduction to Apache Beam & No Shard Left Behind: APIs for Massive Parallel...Introduction to Apache Beam & No Shard Left Behind: APIs for Massive Parallel...
Introduction to Apache Beam & No Shard Left Behind: APIs for Massive Parallel...
 
Python Streaming Pipelines on Flink - Beam Meetup at Lyft 2019
Python Streaming Pipelines on Flink - Beam Meetup at Lyft 2019Python Streaming Pipelines on Flink - Beam Meetup at Lyft 2019
Python Streaming Pipelines on Flink - Beam Meetup at Lyft 2019
 
Flink Forward San Francisco 2019: Streaming your Lyft Ride Prices - Thomas We...
Flink Forward San Francisco 2019: Streaming your Lyft Ride Prices - Thomas We...Flink Forward San Francisco 2019: Streaming your Lyft Ride Prices - Thomas We...
Flink Forward San Francisco 2019: Streaming your Lyft Ride Prices - Thomas We...
 
Flink Forward San Francisco 2019: Streaming your Lyft Ride Prices - Thomas We...
Flink Forward San Francisco 2019: Streaming your Lyft Ride Prices - Thomas We...Flink Forward San Francisco 2019: Streaming your Lyft Ride Prices - Thomas We...
Flink Forward San Francisco 2019: Streaming your Lyft Ride Prices - Thomas We...
 
Realizing the Promise of Portable Data Processing with Apache Beam
Realizing the Promise of Portable Data Processing with Apache BeamRealizing the Promise of Portable Data Processing with Apache Beam
Realizing the Promise of Portable Data Processing with Apache Beam
 
Stream processing for the masses with beam, python and flink
Stream processing for the masses with beam, python and flink Stream processing for the masses with beam, python and flink
Stream processing for the masses with beam, python and flink
 
Apache Beam (incubating)
Apache Beam (incubating)Apache Beam (incubating)
Apache Beam (incubating)
 
Apache Pulsar Development 101 with Python
Apache Pulsar Development 101 with PythonApache Pulsar Development 101 with Python
Apache Pulsar Development 101 with Python
 
Stream-Native Processing with Pulsar Functions
Stream-Native Processing with Pulsar FunctionsStream-Native Processing with Pulsar Functions
Stream-Native Processing with Pulsar Functions
 
apidays LIVE Helsinki - Implementing OpenAPI and GraphQL Services with gRPC b...
apidays LIVE Helsinki - Implementing OpenAPI and GraphQL Services with gRPC b...apidays LIVE Helsinki - Implementing OpenAPI and GraphQL Services with gRPC b...
apidays LIVE Helsinki - Implementing OpenAPI and GraphQL Services with gRPC b...
 
Build Great Networked APIs with Swift, OpenAPI, and gRPC
Build Great Networked APIs with Swift, OpenAPI, and gRPCBuild Great Networked APIs with Swift, OpenAPI, and gRPC
Build Great Networked APIs with Swift, OpenAPI, and gRPC
 
workshop_8_c__.pdf
workshop_8_c__.pdfworkshop_8_c__.pdf
workshop_8_c__.pdf
 
Introduction to Apache Beam
Introduction to Apache BeamIntroduction to Apache Beam
Introduction to Apache Beam
 
Realizing the promise of portable data processing with Apache Beam
Realizing the promise of portable data processing with Apache BeamRealizing the promise of portable data processing with Apache Beam
Realizing the promise of portable data processing with Apache Beam
 
OpenDaylight and YANG
OpenDaylight and YANGOpenDaylight and YANG
OpenDaylight and YANG
 
Unified, Efficient, and Portable Data Processing with Apache Beam
Unified, Efficient, and Portable Data Processing with Apache BeamUnified, Efficient, and Portable Data Processing with Apache Beam
Unified, Efficient, and Portable Data Processing with Apache Beam
 

More from Flink Forward

Building a fully managed stream processing platform on Flink at scale for Lin...
Building a fully managed stream processing platform on Flink at scale for Lin...Building a fully managed stream processing platform on Flink at scale for Lin...
Building a fully managed stream processing platform on Flink at scale for Lin...Flink Forward
 
Evening out the uneven: dealing with skew in Flink
Evening out the uneven: dealing with skew in FlinkEvening out the uneven: dealing with skew in Flink
Evening out the uneven: dealing with skew in FlinkFlink Forward
 
“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...
“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...
“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...Flink Forward
 
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...Flink Forward
 
Introducing the Apache Flink Kubernetes Operator
Introducing the Apache Flink Kubernetes OperatorIntroducing the Apache Flink Kubernetes Operator
Introducing the Apache Flink Kubernetes OperatorFlink Forward
 
Autoscaling Flink with Reactive Mode
Autoscaling Flink with Reactive ModeAutoscaling Flink with Reactive Mode
Autoscaling Flink with Reactive ModeFlink Forward
 
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...Flink Forward
 
One sink to rule them all: Introducing the new Async Sink
One sink to rule them all: Introducing the new Async SinkOne sink to rule them all: Introducing the new Async Sink
One sink to rule them all: Introducing the new Async SinkFlink Forward
 
Tuning Apache Kafka Connectors for Flink.pptx
Tuning Apache Kafka Connectors for Flink.pptxTuning Apache Kafka Connectors for Flink.pptx
Tuning Apache Kafka Connectors for Flink.pptxFlink Forward
 
Flink powered stream processing platform at Pinterest
Flink powered stream processing platform at PinterestFlink powered stream processing platform at Pinterest
Flink powered stream processing platform at PinterestFlink Forward
 
Apache Flink in the Cloud-Native Era
Apache Flink in the Cloud-Native EraApache Flink in the Cloud-Native Era
Apache Flink in the Cloud-Native EraFlink Forward
 
Where is my bottleneck? Performance troubleshooting in Flink
Where is my bottleneck? Performance troubleshooting in FlinkWhere is my bottleneck? Performance troubleshooting in Flink
Where is my bottleneck? Performance troubleshooting in FlinkFlink Forward
 
Using the New Apache Flink Kubernetes Operator in a Production Deployment
Using the New Apache Flink Kubernetes Operator in a Production DeploymentUsing the New Apache Flink Kubernetes Operator in a Production Deployment
Using the New Apache Flink Kubernetes Operator in a Production DeploymentFlink Forward
 
The Current State of Table API in 2022
The Current State of Table API in 2022The Current State of Table API in 2022
The Current State of Table API in 2022Flink Forward
 
Dynamic Rule-based Real-time Market Data Alerts
Dynamic Rule-based Real-time Market Data AlertsDynamic Rule-based Real-time Market Data Alerts
Dynamic Rule-based Real-time Market Data AlertsFlink Forward
 
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
Exactly-Once Financial Data Processing at Scale with Flink and PinotExactly-Once Financial Data Processing at Scale with Flink and Pinot
Exactly-Once Financial Data Processing at Scale with Flink and PinotFlink Forward
 
Processing Semantically-Ordered Streams in Financial Services
Processing Semantically-Ordered Streams in Financial ServicesProcessing Semantically-Ordered Streams in Financial Services
Processing Semantically-Ordered Streams in Financial ServicesFlink Forward
 
Tame the small files problem and optimize data layout for streaming ingestion...
Tame the small files problem and optimize data layout for streaming ingestion...Tame the small files problem and optimize data layout for streaming ingestion...
Tame the small files problem and optimize data layout for streaming ingestion...Flink Forward
 
Batch Processing at Scale with Flink & Iceberg
Batch Processing at Scale with Flink & IcebergBatch Processing at Scale with Flink & Iceberg
Batch Processing at Scale with Flink & IcebergFlink Forward
 
Welcome to the Flink Community!
Welcome to the Flink Community!Welcome to the Flink Community!
Welcome to the Flink Community!Flink Forward
 

More from Flink Forward (20)

Building a fully managed stream processing platform on Flink at scale for Lin...
Building a fully managed stream processing platform on Flink at scale for Lin...Building a fully managed stream processing platform on Flink at scale for Lin...
Building a fully managed stream processing platform on Flink at scale for Lin...
 
Evening out the uneven: dealing with skew in Flink
Evening out the uneven: dealing with skew in FlinkEvening out the uneven: dealing with skew in Flink
Evening out the uneven: dealing with skew in Flink
 
“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...
“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...
“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...
 
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
 
Introducing the Apache Flink Kubernetes Operator
Introducing the Apache Flink Kubernetes OperatorIntroducing the Apache Flink Kubernetes Operator
Introducing the Apache Flink Kubernetes Operator
 
Autoscaling Flink with Reactive Mode
Autoscaling Flink with Reactive ModeAutoscaling Flink with Reactive Mode
Autoscaling Flink with Reactive Mode
 
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
 
One sink to rule them all: Introducing the new Async Sink
One sink to rule them all: Introducing the new Async SinkOne sink to rule them all: Introducing the new Async Sink
One sink to rule them all: Introducing the new Async Sink
 
Tuning Apache Kafka Connectors for Flink.pptx
Tuning Apache Kafka Connectors for Flink.pptxTuning Apache Kafka Connectors for Flink.pptx
Tuning Apache Kafka Connectors for Flink.pptx
 
Flink powered stream processing platform at Pinterest
Flink powered stream processing platform at PinterestFlink powered stream processing platform at Pinterest
Flink powered stream processing platform at Pinterest
 
Apache Flink in the Cloud-Native Era
Apache Flink in the Cloud-Native EraApache Flink in the Cloud-Native Era
Apache Flink in the Cloud-Native Era
 
Where is my bottleneck? Performance troubleshooting in Flink
Where is my bottleneck? Performance troubleshooting in FlinkWhere is my bottleneck? Performance troubleshooting in Flink
Where is my bottleneck? Performance troubleshooting in Flink
 
Using the New Apache Flink Kubernetes Operator in a Production Deployment
Using the New Apache Flink Kubernetes Operator in a Production DeploymentUsing the New Apache Flink Kubernetes Operator in a Production Deployment
Using the New Apache Flink Kubernetes Operator in a Production Deployment
 
The Current State of Table API in 2022
The Current State of Table API in 2022The Current State of Table API in 2022
The Current State of Table API in 2022
 
Dynamic Rule-based Real-time Market Data Alerts
Dynamic Rule-based Real-time Market Data AlertsDynamic Rule-based Real-time Market Data Alerts
Dynamic Rule-based Real-time Market Data Alerts
 
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
Exactly-Once Financial Data Processing at Scale with Flink and PinotExactly-Once Financial Data Processing at Scale with Flink and Pinot
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
 
Processing Semantically-Ordered Streams in Financial Services
Processing Semantically-Ordered Streams in Financial ServicesProcessing Semantically-Ordered Streams in Financial Services
Processing Semantically-Ordered Streams in Financial Services
 
Tame the small files problem and optimize data layout for streaming ingestion...
Tame the small files problem and optimize data layout for streaming ingestion...Tame the small files problem and optimize data layout for streaming ingestion...
Tame the small files problem and optimize data layout for streaming ingestion...
 
Batch Processing at Scale with Flink & Iceberg
Batch Processing at Scale with Flink & IcebergBatch Processing at Scale with Flink & Iceberg
Batch Processing at Scale with Flink & Iceberg
 
Welcome to the Flink Community!
Welcome to the Flink Community!Welcome to the Flink Community!
Welcome to the Flink Community!
 

Recently uploaded

MEASURES OF DISPERSION I BSc Botany .ppt
MEASURES OF DISPERSION I BSc Botany .pptMEASURES OF DISPERSION I BSc Botany .ppt
MEASURES OF DISPERSION I BSc Botany .pptaigil2
 
5 Ds to Define Data Archiving Best Practices
5 Ds to Define Data Archiving Best Practices5 Ds to Define Data Archiving Best Practices
5 Ds to Define Data Archiving Best PracticesDataArchiva
 
The Universal GTM - how we design GTM and dataLayer
The Universal GTM - how we design GTM and dataLayerThe Universal GTM - how we design GTM and dataLayer
The Universal GTM - how we design GTM and dataLayerPavel Šabatka
 
ChistaDATA Real-Time DATA Analytics Infrastructure
ChistaDATA Real-Time DATA Analytics InfrastructureChistaDATA Real-Time DATA Analytics Infrastructure
ChistaDATA Real-Time DATA Analytics Infrastructuresonikadigital1
 
Virtuosoft SmartSync Product Introduction
Virtuosoft SmartSync Product IntroductionVirtuosoft SmartSync Product Introduction
Virtuosoft SmartSync Product Introductionsanjaymuralee1
 
Cash Is Still King: ATM market research '2023
Cash Is Still King: ATM market research '2023Cash Is Still King: ATM market research '2023
Cash Is Still King: ATM market research '2023Vladislav Solodkiy
 
YourView Panel Book.pptx YourView Panel Book.
YourView Panel Book.pptx YourView Panel Book.YourView Panel Book.pptx YourView Panel Book.
YourView Panel Book.pptx YourView Panel Book.JasonViviers2
 
Mapping the pubmed data under different suptopics using NLP.pptx
Mapping the pubmed data under different suptopics using NLP.pptxMapping the pubmed data under different suptopics using NLP.pptx
Mapping the pubmed data under different suptopics using NLP.pptxVenkatasubramani13
 
AI for Sustainable Development Goals (SDGs)
AI for Sustainable Development Goals (SDGs)AI for Sustainable Development Goals (SDGs)
AI for Sustainable Development Goals (SDGs)Data & Analytics Magazin
 
Elements of language learning - an analysis of how different elements of lang...
Elements of language learning - an analysis of how different elements of lang...Elements of language learning - an analysis of how different elements of lang...
Elements of language learning - an analysis of how different elements of lang...PrithaVashisht1
 
Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024
Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024
Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024Guido X Jansen
 
CI, CD -Tools to integrate without manual intervention
CI, CD -Tools to integrate without manual interventionCI, CD -Tools to integrate without manual intervention
CI, CD -Tools to integrate without manual interventionajayrajaganeshkayala
 
Strategic CX: A Deep Dive into Voice of the Customer Insights for Clarity
Strategic CX: A Deep Dive into Voice of the Customer Insights for ClarityStrategic CX: A Deep Dive into Voice of the Customer Insights for Clarity
Strategic CX: A Deep Dive into Voice of the Customer Insights for ClarityAggregage
 
TINJUAN PEMROSESAN TRANSAKSI DAN ERP.pptx
TINJUAN PEMROSESAN TRANSAKSI DAN ERP.pptxTINJUAN PEMROSESAN TRANSAKSI DAN ERP.pptx
TINJUAN PEMROSESAN TRANSAKSI DAN ERP.pptxDwiAyuSitiHartinah
 
How is Real-Time Analytics Different from Traditional OLAP?
How is Real-Time Analytics Different from Traditional OLAP?How is Real-Time Analytics Different from Traditional OLAP?
How is Real-Time Analytics Different from Traditional OLAP?sonikadigital1
 
Master's Thesis - Data Science - Presentation
Master's Thesis - Data Science - PresentationMaster's Thesis - Data Science - Presentation
Master's Thesis - Data Science - PresentationGiorgio Carbone
 
SFBA Splunk Usergroup meeting March 13, 2024
SFBA Splunk Usergroup meeting March 13, 2024SFBA Splunk Usergroup meeting March 13, 2024
SFBA Splunk Usergroup meeting March 13, 2024Becky Burwell
 

Recently uploaded (17)

MEASURES OF DISPERSION I BSc Botany .ppt
MEASURES OF DISPERSION I BSc Botany .pptMEASURES OF DISPERSION I BSc Botany .ppt
MEASURES OF DISPERSION I BSc Botany .ppt
 
5 Ds to Define Data Archiving Best Practices
5 Ds to Define Data Archiving Best Practices5 Ds to Define Data Archiving Best Practices
5 Ds to Define Data Archiving Best Practices
 
The Universal GTM - how we design GTM and dataLayer
The Universal GTM - how we design GTM and dataLayerThe Universal GTM - how we design GTM and dataLayer
The Universal GTM - how we design GTM and dataLayer
 
ChistaDATA Real-Time DATA Analytics Infrastructure
ChistaDATA Real-Time DATA Analytics InfrastructureChistaDATA Real-Time DATA Analytics Infrastructure
ChistaDATA Real-Time DATA Analytics Infrastructure
 
Virtuosoft SmartSync Product Introduction
Virtuosoft SmartSync Product IntroductionVirtuosoft SmartSync Product Introduction
Virtuosoft SmartSync Product Introduction
 
Cash Is Still King: ATM market research '2023
Cash Is Still King: ATM market research '2023Cash Is Still King: ATM market research '2023
Cash Is Still King: ATM market research '2023
 
YourView Panel Book.pptx YourView Panel Book.
YourView Panel Book.pptx YourView Panel Book.YourView Panel Book.pptx YourView Panel Book.
YourView Panel Book.pptx YourView Panel Book.
 
Mapping the pubmed data under different suptopics using NLP.pptx
Mapping the pubmed data under different suptopics using NLP.pptxMapping the pubmed data under different suptopics using NLP.pptx
Mapping the pubmed data under different suptopics using NLP.pptx
 
AI for Sustainable Development Goals (SDGs)
AI for Sustainable Development Goals (SDGs)AI for Sustainable Development Goals (SDGs)
AI for Sustainable Development Goals (SDGs)
 
Elements of language learning - an analysis of how different elements of lang...
Elements of language learning - an analysis of how different elements of lang...Elements of language learning - an analysis of how different elements of lang...
Elements of language learning - an analysis of how different elements of lang...
 
Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024
Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024
Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024
 
CI, CD -Tools to integrate without manual intervention
CI, CD -Tools to integrate without manual interventionCI, CD -Tools to integrate without manual intervention
CI, CD -Tools to integrate without manual intervention
 
Strategic CX: A Deep Dive into Voice of the Customer Insights for Clarity
Strategic CX: A Deep Dive into Voice of the Customer Insights for ClarityStrategic CX: A Deep Dive into Voice of the Customer Insights for Clarity
Strategic CX: A Deep Dive into Voice of the Customer Insights for Clarity
 
TINJUAN PEMROSESAN TRANSAKSI DAN ERP.pptx
TINJUAN PEMROSESAN TRANSAKSI DAN ERP.pptxTINJUAN PEMROSESAN TRANSAKSI DAN ERP.pptx
TINJUAN PEMROSESAN TRANSAKSI DAN ERP.pptx
 
How is Real-Time Analytics Different from Traditional OLAP?
How is Real-Time Analytics Different from Traditional OLAP?How is Real-Time Analytics Different from Traditional OLAP?
How is Real-Time Analytics Different from Traditional OLAP?
 
Master's Thesis - Data Science - Presentation
Master's Thesis - Data Science - PresentationMaster's Thesis - Data Science - Presentation
Master's Thesis - Data Science - Presentation
 
SFBA Splunk Usergroup meeting March 13, 2024
SFBA Splunk Usergroup meeting March 13, 2024SFBA Splunk Usergroup meeting March 13, 2024
SFBA Splunk Usergroup meeting March 13, 2024
 

Flink Forward Berlin 2017: Aljoscha Krettek - Talk Python to me: Stream Processing in Your Favourite Language with Beam on Flink

  • 1. Talk Python to Me Stream Processing in Your Favourite Language with Beam on Flink Apache Beam Slides by Aljoscha Krettek, September 2017, Flink Forward 2017 Apache Flink Based on work and slides by Frances Perry, Tyler Akidau, Kenneth Knowles & Sourabh Bajaj
  • 2. 2 Agenda 1. What is Beam? 2. The Beam Portability APIs (Fn / Pipeline) 3. Executing Pythonic Beam Jobs on Flink 4. The Future
  • 4. The Evolution of Apache Beam MapReduce BigTable DremelColossus FlumeMegastoreSpanner PubSub Millwheel Apache Beam Google Cloud Dataflow
  • 5. 5 Beam Model: Generations Beyond MapReduce Improved abstractions let you focus on your application logic Batch and stream processing are both first-class citizens -- no need to choose. Clearly separates event time from processing time.
  • 6. 6 The Apache Beam Vision 1. End users: who want to write pipelines in a language that’s familiar. 2. SDK writers: who want to make Beam concepts available in new languages. 3. Runner writers: who have a distributed processing environment and want to support Beam pipelines Beam Model: Fn Runners Apache Flink Apache Spark Beam Model: Pipeline Construction Other LanguagesBeam Java Beam Python Execution Execution Cloud Dataflow Execution
  • 7. 7 The Beam Model (Flink draws it more like this)
  • 9. 9 Beam Model: Asking the Right Questions What results are calculated? Where in event time are results calculated? When in processing time are results materialized? How do refinements of results relate?
  • 10. The Beam Model: What is Being Computed? PCollection<KV<String, Integer>> scores = input .apply(Sum.integersPerKey()); scores= (input | Sum.integersPerKey())
  • 11. The Beam Model: What is Being Computed?
  • 12. The Beam Model: Where in Event Time? PCollection<KV<String, Integer>> scores = input .apply(Window.into(FixedWindows.of(Duration.standardMinutes(2))) .apply(Sum.integersPerKey()); scores= (input | beam.WindowInto(FixedWindows(2 * 60)) | Sum.integersPerKey())
  • 13. The Beam Model: Where in Event Time?
  • 14. The Beam Model: When in Processing Time? PCollection<KV<String, Integer>> scores = input .apply(Window.into(FixedWindows.of(Duration.standardMinutes(2)) .triggering(AtWatermark())) .apply(Sum.integersPerKey()); scores = (input | beam.WindowInto(FixedWindows(2 * 60) .triggering(AtWatermark())) | Sum.integersPerKey())
  • 15. The Beam Model: When in Processing Time?
  • 16. The Beam Model: How Do Refinements Relate? PCollection<KV<String, Integer>> scores = input .apply(Window.into(FixedWindows.of(Duration.standardMinutes(2)) .triggering(AtWatermark() .withEarlyFirings(AtPeriod(Duration.standardMinutes(1))) .withLateFirings(AtCount(1))) .accumulatingFiredPanes()) .apply(Sum.integersPerKey()); scores = (input | beam.WindowInto(FixedWindows(2 * 60) .triggering(AtWatermark() .withEarlyFirings(AtPeriod(1 * 60)) .withLateFirings(AtCount(1))) .accumulatingFiredPanes()) | Sum.integersPerKey())
  • 17. The Beam Model: How Do Refinements Relate?
  • 18. 18 Customizing What Where When How 3 Streaming 4 Streaming + Accumulation 1 Classic Batch For more information see https://cloud.google.com/dataflow/examples/gaming-example 2 Windowed Batch
  • 19. A Complete Example of Pythonic Beam Code import apache_beam as beam with beam.Pipeline() as p: (p | beam.io.ReadStringsFromPubSub("twitter_topic") | beam.WindowInto(SlidingWindows(5*60, 1*60)) | beam.ParDo(ParseHashTagDoFn()) | beam.combiners.Count.PerElement() | beam.ParDo(BigQueryOutputFormatDoFn()) | beam.io.WriteToBigQuery("trends_table"))
  • 20. What is Apache Beam? 1. The Beam Model: What / Where / When / How 2. SDKs for writing Beam pipelines 3. Runners for Existing Distributed Processing Backends ○ Apache Apex ○ Apache Flink ○ Apache Spark ○ Google Cloud Dataflow ○ Local (in-process) runner for testing
  • 21. 2121 Beam Portability APIs (Pipeline / Job / Fn)
  • 22. 22 What are we trying to solve? ● Executing user code written in an arbitrary language (Python) on a Runner written in a different language (Java) ● Mixing user functions written in different languages (Connectors, Sources, Sinks, …)
  • 23. 23 Terminology Beam Model Describes the API concepts and the possible operations on PCollections. Pipeline User-defined graph of transformations on PCollections. This is constructed using a Beam SDK. The transformations can contain UDFs. Runner Executes a Pipeline. For example: FlinkRunner. Beam SDK Language specific library/framework for creating programs that use the Beam Model. Allows defining Pipelines and UDFs and provides APIs for executing them. User-defined function (UDF) Code in Java, Python, … that specifies how data is transformed. For example DoFn or CombineFn.
  • 24. 24 Executing a Beam Pipeline - The Big Picture SDK User Pipeline Pipeline API Runner Worker Fn API Job API
  • 25. 25 APIs for Different Pipeline Lifecycle Stages Pipeline API ● Used by the SDK to construct SDK-agnostic Pipeline representation ● Used by the Runner to translate a Pipeline to runner-specific operations Fn API ● Used by an SDK harness for communication with a Runner ● User by the Runner to push work into an SDK harness Job API ● (API for interacting with a running Pipeline)
  • 26. 26 Pipeline API (simplified) ● Definition of common primitive transformations (Read, ParDo, Flatten, Window.into, GroupByKey) ● Definition of serialized Pipeline (protobuf) https://s.apache.org/beam-runner-ap i Pipeline = {PCollection*, PTransform*, WindowingStrategy*, Coder*} PTransform = {Inputs*, Outputs*, FunctionSpec} FunctionSpec = {URN, payload}
  • 27. 27 Job API public interface JobApi { State getState(); // RUNNING, DONE, CANCELED, FAILED ... State cancel() throws IOException; State waitUntilFinish(Duration duration); State waitUntilFinish(); MetricResults metrics(); }
  • 28. 28 Fn API ● gRPC interface definitions for communication between an SDK harness and a Runner https://s.apache.org/beam-fn-api ● Control: Used to tell the SDK which UDFs to execute and when to execute them. ● Data: Used to move data between the language specific SDK harness and the runner. ● State: Used to support user state, side inputs, and group by key reiteration. ● Logging: Used to aggregate logging information from the language specific SDK harness.
  • 30. 30 Fn API - Bundle Processing https://s.apache.org/beam-fn-api-processing-a-bundl e
  • 31. 31 Fn API - Processing DoFns https://s.apache.org/beam-fn-api-send-and-receive-data Say we need to execute this part
  • 32. 32 Fn API - Processing DoFns https://s.apache.org/beam-fn-api-send-and-receive-data Python DoFn Python DoFn
  • 33. 33 Fn API - Processing DoFns (Pipeline manipulation) https://s.apache.org/beam-fn-api-send-and-receive-data Python DoFn Python DoFn gRPC Source gRPC Sink The Runner inserts these
  • 34. 34 Fn API - Executing the user Fn using a SDK Harness ● We can execute as a separate process ● We can execute in a Docker container Worker Fn API https://s.apache.org/beam-fn-api-container-contract ● Repository of containers for different SDKs ● We inject the user code into the container when starting ● Container is user-configurable
  • 35. 3535 Executing Pythonic* Beam Jobs on Fink *or other languages
  • 36. 36 What is the (Flink) Runner/Flink doing in all this? ● Analyze/transform the Pipeline (Pipeline API) ● Create a Flink Job (DataSet/DataStream API) ● Ship the user code/docker container description ● In an operator: Open gRPC services for control/data/logging/state plane ● Execute arbitrary user code using the Fn API Easy, because Flink state/timers map well to Beam concepts!
  • 37. 37 Advantages/Disadvantages ● Complete isolation of user code ● Complete configurability of execution environment (with Docker) ● We can support code written in arbitrary languages ● We can mix user code written in different languages ● Slower (RPC overhead) ● Using Docker requires docker
  • 39. 39 Future work ● Finish what I just talked about ● Finalize the different APIs (not Flink-specific) ● Mixing and matching connectors written in different languages ● Wait for new SDKs in other languages, they will just work
  • 40. 40 Learn More! Apache Beam/Apache Flink https://flink.apache.org / https://beam.apache.org Beam Fn API design documents https://s.apache.org/beam-runner-api https://s.apache.org/beam-fn-api https://s.apache.org/beam-fn-api-processing-a-bundle https://s.apache.org/beam-fn-state-api-and-bundle-processing https://s.apache.org/beam-fn-api-send-and-receive-data https://s.apache.org/beam-fn-api-container-contract Join the mailing lists! user-subscribe@flink.apache.org / dev-subscribe@flink.apache.org user-subscribe@beam.apache.org / dev-subscribe@beam.apache.org Follow @ApacheFlink / @ApacheBeam on Twitter