Pulsar Summit
San Francisco
Hotel Nikko
August 18 2022
Keynote
Beam + Pulsar:
Powerful Stream
Processing at Scale
Byron Ellis
Senior Software Engineering Manager • Google
Cloud
Lead of the Portable Languages and
Tools team who, among other things,
are the team that works on Beam within
Google Cloud.
A long time user of large scale data
processing tools at companies both
large and small and across many
different job functions he is now
working to make it easier for everyone
to use these tools.
Byron Ellis
Senior Software
Engineering Manager
Google
Beam + Pulsar: Powerful Stream Processing at Scale
About Apache Beam
The Beam Model
Example Beam Pipeline Execution
Sum Per Key
Cloud Dataflow
Apache Spark
Apache Flink
Apache Samza
Input.apply
(Sum.integersPerKey())
Java
input | Sum.PerKey()
Python
stats.Sum(s, input)
Go
SELECT key, SUM(value)
FROM input GROUP BY key
SQL
Experimental Runners in
Development
Apache Beam is a
truly unified batch
and streaming data
processing
platform, with no
compromises.
Batch is batch, and
streaming is
streaming.
Beam has official
Java, Python, and
Golang SDKs with
an independent
Scala SDK available
as well.
I/Os are included in
Beam, meaning that
discoverability of
new I/Os is super
easy - just upgrade
to the latest and
see what you get.
(This is why we’re
here today!)
Run Beam pipelines
on your existing
Spark clusters,
Google Cloud
Dataflow, AWS KDA,
Talend, self-hosted
Flink, or the newest
experimental
runners. Use the
DirectRunner for
local testing and
development.
Truly Unified SDKs Included I/Os Portability
What Makes Apache Beam Different?
Beam + Pulsar: Powerful Stream Processing at Scale
Pulsar on Apache Beam
Beam and Pulsar = 🤝
Beam is unique in that I/O connectors are
included in the project and are not
external. As a result, the Beam community
and those of us at Google with Beam
responsibilities are always looking for
better I/O support.
Seeing the traction of Pulsar in the market,
Google decided to kickstart a PulsarIO by
engaging with a third-party vendor to
create a new PulsarIO Beam connector.
First PR from
StreamNative on
PulsarIO
https://github.com/apache
/beam/pull/22026
Initial PR for
PulsarIO:
https://github.com/apache
/beam/pull/15572
PR for PulsarIO
merged into
Apache Beam main:
https://github.com/apache
/beam/pull/16634
Design doc for
PulsarIO Beam
connector:
https://docs.google.com/d
ocument/d/11U81IEeB0rly
63Ly62CTIa45fuvm05TGb
TcXV2wRPnM/edit#headin
g=h.hxrvoaq1om85
September 2021 March 2022
January 2022 June 2022
PulsarIO is now a part of Beam and you can
use it now!
Please give it a whirl - try PulsarIO on your
own dev machine with the Beam
DirectRunner. Or try running Beam pipelines
on your existing Spark or Flink clusters, or on
the Google Cloud Dataflow managed service.
Beam + Pulsar: Powerful Stream Processing at Scale
What’s next for Beam + Pulsar
StreamNative will ensure that the Apache Beam PulsarIO
will be a certified StreamNative Cloud Compute Engine
Connector!
Sachin Agarwal
Thank you!
sachinag @ google.com
@sachinag
/in/sachinag
Pulsar Summit
San Francisco
Hotel Nikko
August 18 2022

Beam + Pulsar: Powerful Stream Processing at Scale - Pulsar Summit SF 2022

  • 1.
    Pulsar Summit San Francisco HotelNikko August 18 2022 Keynote Beam + Pulsar: Powerful Stream Processing at Scale Byron Ellis Senior Software Engineering Manager • Google Cloud
  • 2.
    Lead of thePortable Languages and Tools team who, among other things, are the team that works on Beam within Google Cloud. A long time user of large scale data processing tools at companies both large and small and across many different job functions he is now working to make it easier for everyone to use these tools. Byron Ellis Senior Software Engineering Manager Google
  • 3.
    Beam + Pulsar:Powerful Stream Processing at Scale About Apache Beam
  • 4.
  • 5.
    Example Beam PipelineExecution Sum Per Key Cloud Dataflow Apache Spark Apache Flink Apache Samza Input.apply (Sum.integersPerKey()) Java input | Sum.PerKey() Python stats.Sum(s, input) Go SELECT key, SUM(value) FROM input GROUP BY key SQL Experimental Runners in Development
  • 6.
    Apache Beam isa truly unified batch and streaming data processing platform, with no compromises. Batch is batch, and streaming is streaming. Beam has official Java, Python, and Golang SDKs with an independent Scala SDK available as well. I/Os are included in Beam, meaning that discoverability of new I/Os is super easy - just upgrade to the latest and see what you get. (This is why we’re here today!) Run Beam pipelines on your existing Spark clusters, Google Cloud Dataflow, AWS KDA, Talend, self-hosted Flink, or the newest experimental runners. Use the DirectRunner for local testing and development. Truly Unified SDKs Included I/Os Portability What Makes Apache Beam Different?
  • 7.
    Beam + Pulsar:Powerful Stream Processing at Scale Pulsar on Apache Beam
  • 8.
    Beam and Pulsar= 🤝 Beam is unique in that I/O connectors are included in the project and are not external. As a result, the Beam community and those of us at Google with Beam responsibilities are always looking for better I/O support. Seeing the traction of Pulsar in the market, Google decided to kickstart a PulsarIO by engaging with a third-party vendor to create a new PulsarIO Beam connector.
  • 9.
    First PR from StreamNativeon PulsarIO https://github.com/apache /beam/pull/22026 Initial PR for PulsarIO: https://github.com/apache /beam/pull/15572 PR for PulsarIO merged into Apache Beam main: https://github.com/apache /beam/pull/16634 Design doc for PulsarIO Beam connector: https://docs.google.com/d ocument/d/11U81IEeB0rly 63Ly62CTIa45fuvm05TGb TcXV2wRPnM/edit#headin g=h.hxrvoaq1om85 September 2021 March 2022 January 2022 June 2022
  • 10.
    PulsarIO is nowa part of Beam and you can use it now! Please give it a whirl - try PulsarIO on your own dev machine with the Beam DirectRunner. Or try running Beam pipelines on your existing Spark or Flink clusters, or on the Google Cloud Dataflow managed service.
  • 11.
    Beam + Pulsar:Powerful Stream Processing at Scale What’s next for Beam + Pulsar
  • 12.
    StreamNative will ensurethat the Apache Beam PulsarIO will be a certified StreamNative Cloud Compute Engine Connector!
  • 13.
    Sachin Agarwal Thank you! sachinag@ google.com @sachinag /in/sachinag Pulsar Summit San Francisco Hotel Nikko August 18 2022