Cross the Streams! Creating Streaming Data Pipelines with Apache Flink + Apache Pulsar - Pulsar Summit SF 2022

Cross the Streams!
Creating Data Pipelines with Apache Flink + Pulsar
Caito Scherr – Developer Advocate – Ververica

Agenda
00 Who am I?
01 Intro to Flink SQL
02 Flink SQL Demo
03 Flink + Pulsar
@CAITO_200_OK

Who am I?
00 Caito Scherr
01 Apache Flink
02 DevRel @ Ververica
03 Portland, Oregon
@CAITO_200_OK

Stream Processing
@CAITO_200_OK

Stream Processing > The Challenges
@CAITO_200_OK
● You can’t pause to ﬁx it
● Lots of data, FAST
● Ingesting multiple formats
● Failure recovery
● Needs to scale

Flink > Addressing Stream Processing’s Challenges
@CAITO_200_OK

Flink > Basics
@CAITO_200_OK
12
Building Blocks (events, state, (event) time)
DataStream API (streams, windows)
Table API (dynamic tables)
Flink SQL
PyFlink
Ease of Use
Expressiven
ess
Streaming Analytics
& ML
Stateful Stream
Processing

Flink > Summary
@CAITO_200_OK
Flexible APIs
● Ease of use/Expressiveness
● Wide Range of Use Cases
High Performance
● Local State Access
● High Throughput/Low Latency
Stateful Processing
● State = First-class Citizen
● Event-time Support
Fault Tolerance
● Distributed State Snapshots
● Exactly-once Guarantees

Flink SQL
@CAITO_200_OK
● Stream processing: real-time processing
● Stream processing is complex
● Flink is highly performant streaming
● Flink solves many problems in streaming
● Flink is complex
● Flink SQL: access to Flink’s beneﬁts
● Abstracts away the complexity

Flink SQL Demo
@CAITO_200_OK
● Making the complex simple
● You could start a data pipeline anywhere!
● Language agnostic
From: Free Guy movie

Flink SQL Demo > Regular SQL
@CAITO_200_OK
user cnt
Mary 2
Bob 1
SELECT user_id,
COUNT(url) AS cnt
FROM clicks
GROUP BY user_id;
Take a snapshot when the
query starts
A ﬁnal result is
produced
A row that was added after the query
was started is not considered
user cTime url
Mary 12:00:00 https://…
Bob 12:00:00 https://…
Mary 12:00:02 https://…
Liz 12:00:03 https://…
The query
terminates
Image: Marta Paes @morsapaes

Flink SQL Demo > Flink SQL
@CAITO_200_OK
user cTime url
user cnt
SELECT user_id,
COUNT(url) AS cnt
FROM clicks
GROUP BY user_id;
Mary 12:00:00 https://…
Bob 12:00:00 https://…
Mary 12:00:02 https://…
Liz 12:00:03 https://…
Bob 1
Liz 1
Mary 1
Mary 2
Ingest all changes as
they happen
Continuously update the
result
The result is identical to the one-time query (at this point)
Image: Marta Paes @morsapaes

Flink SQL Demo
@CAITO_200_OK
● Check Java version
● Download Flink Snapshot
● Un-tar it

26
What Next? >> Flink SQL Cookbook

Flink SQL Demo
@CAITO_200_OK
● Flink SQL + DataGen
● Same startup steps
● True stream processing example

Flink + Pulsar
@CAITO_200_OK
“Stream as a uniﬁed view
on data”
“Batch as a special case
of streaming”

Flink + Pulsar
@CAITO_200_OK
● Pub/Sub messaging layer (streaming)
● Durable storage layer (batch)

Flink + Pulsar > Uniﬁed Processing with Flink
@CAITO_200_OK
● Mix historic & real-time
● Reuse code & logic
● Simplify operations
now
bounded query
unbounded query
past future
bounded query
start of the stream
unbounded query

Flink + Pulsar > Unified data stack
@CAITO_200_OK
Unified Processing
Engine
(Batch / Streaming)
Unified Storage
(Segments / Pub/Sub)

Demo > Twier Firehose
@CAITO_200_OK
Demo: Marta Paes @morsapaes

@CAITO_200_OK
CREATE CATALOG pulsar WITH (
'type' = 'pulsar',
'service-url' = 'pulsar://pulsar:6650',
'admin-url' = 'http://pulsar:8080',
'format' = 'json'
);
Catalog DDL

@CAITO_200_OK
Not cool. 👹

Demo > Get Relevant Timestamps
@CAITO_200_OK
CREATE TABLE pulsar_tweets (
publishTime TIMESTAMP(3) METADATA,
WATERMARK FOR publishTime AS publishTime - INTERVAL '5'
SECOND
) WITH (
'connector' = 'pulsar',
'topic' = 'persistent://public/default/tweets',
'value.format' = 'json',
'service-url' = 'pulsar://pulsar:6650',
'admin-url' = 'http://pulsar:8080',
'scan.startup.mode' = 'earliest-offset'
)
LIKE tweets;
Derive schema from the original topic
Deﬁne the source connector (Pulsar)
Read and use Pulsar message metadata

Demo > Windowed Aggregation
@CAITO_200_OK
CREATE TABLE pulsar_tweets_agg (
tmstmp TIMESTAMP(3),
tweet_cnt BIGINT
) WITH (
'connector'='pulsar',
'topic'='persistent://public/default/tweets_agg',
'value.format'='json',
'service-url'='pulsar://pulsar:6650',
'admin-url'='http://pulsar:8080'
);
Sink Table DDL
INSERT INTO pulsar_tweets_agg
SELECT TUMBLE_START(publishTime, INTERVAL '10'
SECOND) AS wStart,
COUNT(id) AS tweet_cnt
FROM pulsar_tweets
GROUP BY TUMBLE(publishTime, INTERVAL '10'
SECOND);
Continuous SQL Query

Demo > Tweet Count in Windows
@CAITO_200_OK

How to Get Involved
@CAITO_200_OK
● Geing involved page: one source for Flink
community resources
● hps://ﬂink.apache.org/community.html

Contribute
@CAITO_200_OK
● Github
● Issue Tracker
● Becoming a Commier

New Slack Space!
@CAITO_200_OK
● Go-to space for user troubleshooting
● 800 members in less than 2 months
● Members include most of the Flink commiers
+ PMC members

New Slack Space!
@CAITO_200_OK

Hangout With Us
@CAITO_200_OK
● Regional meetups
● Virtual and in person options
● hps://www.meetup.com/topics/apache-ﬂink/

Stay Connected
@CAITO_200_OK
● Twier
● Website
● Blog - Flink
● Blog - Ververica
● Youtube

Thank you
info@ververica.com
www.ververica.com
@VervericaData

Questions?
● caito@ververica.com
● @CAITO_200_OK
info@ververica.com
www.ververica.com
@VervericaData

Resources
● Flink Ahead: What Comes After Batch & Streaming: https://youtu.be/h5OYmy9Yx7Y
● Apache Pulsar as one Storage System for Real Time & Historical Data Analysis:
https://medium.com/streamnative/apache-pulsar-as-one-storage-455222c59017
● Flink Table API & SQL:
https://ci.apache.org/projects/flink/flink-docs-master/dev/table/sql/queries.html#ope
rations
● Flink SQL Cookbook: https://github.com/ververica/flink-sql-cookbook
● When Flink & Pulsar Come Together:
https://flink.apache.org/2019/05/03/pulsar-flink.html
● How to Query Pulsar Streams in Flink:
https://flink.apache.org/news/2019/11/25/query-pulsar-streams-using-apache-flink.ht
ml
● What’s New in the Flink/Pulsar Connector:
● https://flink.apache.org/2021/01/07/pulsar-flink-connector-270.html
● Marta’s Demo: https://github.com/morsapaes/flink-sql-pulsar
60
@Caito_200_OK

Cross the Streams! Creating Streaming Data Pipelines with Apache Flink + Apache Pulsar - Pulsar Summit SF 2022

Recommended

Recommended

More Related Content

Similar to Cross the Streams! Creating Streaming Data Pipelines with Apache Flink + Apache Pulsar - Pulsar Summit SF 2022

Similar to Cross the Streams! Creating Streaming Data Pipelines with Apache Flink + Apache Pulsar - Pulsar Summit SF 2022 (20)

More from StreamNative

More from StreamNative (20)

Recently uploaded

Recently uploaded (20)

Cross the Streams! Creating Streaming Data Pipelines with Apache Flink + Apache Pulsar - Pulsar Summit SF 2022