Advanced Stream Processing with Flink and Pulsar - Pulsar Summit NA 2021 Keynote

Stream Processing Advanced
with Flink and Pulsar
Till Rohrmann, Engineering Lead at Ververica, @stsffap
Addison Higham, Chief Architect at StreamNative, @addisonjh

Why is Stream Processing Important?

Spectrum of data streaming applications
offline real time
Data warehousing
OLAP / BI / reporting
Machine learning & model
training
Continuous
ETL
Unified offline /
real-time analytics
Continuous
monitoring
(position, risk, …)
Real-time behaviour
modeling (pricing,
recommenders, …)
Real-time alerts
(fraud, security, …)
Real-time ML
model
training/evaluation
Distributed OLTP
applications
Choice of data architecture defines which uses cases you can cover!

Modern Streaming Data Architecture
Stream
Processing
Stream
Processing
Stream
Processing
Stream Storage
Long Term Storage
Results / Views
Triggered
Applications
Event Producers

Reference Streaming Data Architecture: Flink + Pulsar
Stream Storage
Long Term Storage
Results / Views
Stateful
Functions
Event Producers

Apache Flink: Analytics and Applications on Streaming Data
Flink Runtime
Stateful Computations over Data Streams
Stateful Stream
Processing
Streams, State, Time
Streaming Analytics
SQL and Tables
Event-driven
Applications
Stateful Functions

Apache Flink in Numbers
● Contributors: 887
● Most active Apache ML
● Most active sources by visits and commits (#2)
● Github stars: 16.4k
● Commits: >27k
● Releases: 1.13 latest major release
● Maven downloads per month: ~ 170k
● LOC: > 1.8 million
● Latency: < 1s
● Throughput: 4 billion events/s, 7 TB/s
● Input size: 100 TB for batch job

One of The Most Advanced Stream Processors
● First class support for state
○ Asynchronous barrier checkpointing algorithm to create globally consistent checkpoints
● Event-time support
○ Correctness under delayed events
● End-to-end exactly once processing guarantees
○ Correctness under faults
● Resource elastic
○ Flink applications’ resources can be adjusted to the actual need
● Unified batch-streaming APIs
○ A single query to process historical as well as live data

● Stream: Sequence of data which is made available over time
● All computation processes chunks of data over time producing results over
time → Stream processing
○ E.g. reading data from disks is done in streaming fashion
● Events can be of various forms
● Decisive difference: Is my stream bounded or not?
Everything Is a Stream

SQL / Table API: Unified Batch & Stream Processing
SQL Query
Batch query
execution
SELECT
room,
TUMBLE_END(rowtime, INTERVAL ‘1’ HOUR),
AVG(temperature)
FROM
sensors
GROUP BY
TUMBLE(rowtime, INTERVAL ‘1’ HOUR),
room

Stream-Table Duality
User lastLogin
Alice 2021-01-01
Bob 2021-01-02
User lastLogin
Alice 2021-01-14
Bob 2021-02-01
Eve 2021-01-21
Alice, 2021-01-01
Bob, 2021-01-02
Alice, 2021-01-14
Eve, 2021-01-21
Bob, 2021-02-01

SQL / Table API: Running The Same Query On Streams
SQL Query
Incremental
query execution
SELECT
room,
TUMBLE_END(rowtime, INTERVAL ‘1’ HOUR),
AVG(temperature)
FROM
sensors
GROUP BY
TUMBLE(rowtime, INTERVAL ‘1’ HOUR),
room
Interpret stream as
table

Data Has to Come From Somewhere
● Flink is a powerful compute engine for unified batch and streaming
processing that is pushing boundaries of data processing
● For Flink really to shine, a storage system that supports efficient stream and
batch ingestion is required
Pulsar offers exactly that!
● Apache Pulsar is evolving to meet the needs of a system like Flink, offering
both a low-latency, high bandwidth streaming API as well as a flexible
architecture to support batch access

Why is Pulsar a Great Fit For Streaming And Batch?
Streaming
● Like other streaming systems,
Pulsar can be used like a
distributed scale-out log, providing
consumer controlled offsets and
high throughput through parallel
partitions of topics
Batch
● In addition, Pulsar’s segment-
based, multi-layer architecture
allows for applications to access
historical data more directly via
the underlying storage
Producer Consumer
Broker 2
Broker 1 Broker 3
Topic1-Part1 Topic1-Part2 Topic1-Part3
Segment 2
Segment 1 Segment 3 Segment X
Bookie
1
Bookie
2
Bookie
3
Segment 1
Segment X
Segment 1
Segment X
Segment 1
Segment X
Offload
Segments

Adoption at Scale
While Flink and Pulsar are both widely used independently, together they offer a
strong technology for unified batch and stream storage and compute
Many large organizations are adopting Flink + Pulsar for complex workloads that
can take advantage of the strengths of both systems

State of The Pulsar-Flink Connector Today
● Pulsar-Flink connector supports Flink 1.11, 1.12 and soon 1.13
● Exactly once source and sink via producer deduplication
● Full support for Flink Schema with integration to Pulsar Schema Registry
● Full Flink SQL support for batch and streaming modes
○ Also support for an “upsert” mode which can reason about inserts, updates, and deletes
● Support for KeyShared subscriptions for higher source parallelism than
number of topic-partitions
● https://github.com/streamnative/pulsar-flink

The Road Ahead
Many in-progress features are being developed
● Pulsar-Flink connector being upstreamed into Flink
○ Targeted for Flink 1.14
● Pulsar Transaction Support
○ Pulsar transaction’s allow for stronger exactly-once processing guarantees
● Native Pulsar Watermarking
○ Pulsar is able to broker watermarks between producers and consumers
● Parallel Batch Source
○ Query multiple segments in parallel for higher throughput
● Unified Batch + Streaming Source
○ Using Pulsar’s batch mode for catch up and then switching over to streaming ingestion

Job 1
Flink Support For Pulsar Transactions
Pulsar transactions are GA in Pulsar 2.8.0
Support in Flink is nearing completion, which can allows for end-to-end exactly-
once processing guarantees!
Learn more about transactions at the talk Exactly-Once Made Easy: Transactional Messaging in
Apache Pulsar at 11:30 AM
Job 2
Job 3
tx

The Importance of Watermarks
High-quality watermarks are crucial for correct and
stable stream processing jobs.
In order for results to be correct, we need to take into
account “Event Time” rather than solely relying on
“Processing Time”
This is especially important when replaying or
processing older streaming data
“Watermarks” advance the event-time clock, if the clock
does not “tick” accurately, the results will not be correct
Ideal
Reality
Skew
Event Time

Native Pulsar Watermarking
Pulsar is adding support for watermarks
● Producers have a new API for injecting
watermarks into the stream, with
multiple producers potentially producing
● Watermarks are broadcast across all
partitions, which allows the broker to
dispatch more accurate watermarks to
consumers, in both realtime and
historical scenarios
● The API is designed to be simple to
integrate with Flink, removing the
complexity of accurate watermark
generation from the developer
// Pulsar
for (int = 0; i < NUM_MESSAGES; i++) {
producer.newMessage()
.value("hello with event time! " + i)
.eventTime(
System.currentTimeMillis()).sendAsync();
}
Producer
.newWatermark()
.eventTime(
System.currentTimeMillis()).sendAsync();
// Flink
PulsarSource<String> pulsarSource = new
FlinkPulsarSource(topic, ...);
pulsarSource.assignTimestampsAndWatermarks(
PulsarWatermarkStrategy.forPulsarWatermarks());

Shared Event-Time Domain
End-to-End Watermarks
Job 1
Job 2
Job 3
Producer
W|24
W|17 W|12
W|12
W|8
W|4

Pulsar + Flink and StreamNative + Ververica
Pulsar + Flink community collaboration
● Contributing Pulsar-Flink connector to Flink repository
● Evolve connector to support more advanced Pulsar and Flink features
● Build best in the class open source stream processing platform
StreamNative + Ververica Cloud partnership
● Help customers to unlock the full potential of stream processing

Thanks a lot for your attention!
Questions?

Advanced Stream Processing with Flink and Pulsar - Pulsar Summit NA 2021 Keynote

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Advanced Stream Processing with Flink and Pulsar - Pulsar Summit NA 2021 Keynote

Similar to Advanced Stream Processing with Flink and Pulsar - Pulsar Summit NA 2021 Keynote (20)

More from StreamNative

More from StreamNative (20)

Recently uploaded

Recently uploaded (20)

Advanced Stream Processing with Flink and Pulsar - Pulsar Summit NA 2021 Keynote

Editor's Notes