Despite what the Ghostbusters said, we’re going to go ahead and cross (or, join) the streams. This session covers getting started with streaming data pipelines, maximizing Pulsar’s messaging system alongside one of the most flexible streaming frameworks available, Apache Flink. Specifically, we’ll demonstrate the use of Flink SQL, which provides various abstractions and allows your pipeline to be language-agnostic. So, if you want to leverage the power of a high-speed, highly customizable stream processing engine without the usual overhead and learning curves of the technologies involved (and their interconnected relationships), then this talk is for you. Watch the step-by-step demo to build a unified batch and streaming pipeline from scratch with Pulsar, via the Flink SQL client. This means you don’t need to be familiar with Flink, (or even a specific programming language). The examples provided are built for highly complex systems, but the talk itself will be accessible to any experience level.
9. Stream Processing > The Challenges
@CAITO_200_OK
● You can’t pause to fix it
● Lots of data, FAST
● Ingesting multiple formats
● Failure recovery
● Needs to scale
12. Flink > Basics
@CAITO_200_OK
12
Building Blocks (events, state, (event) time)
DataStream API (streams, windows)
Table API (dynamic tables)
Flink SQL
PyFlink
Ease of Use
Expressiven
ess
Streaming Analytics
& ML
Stateful Stream
Processing
13. Flink > Summary
@CAITO_200_OK
Flexible APIs
● Ease of use/Expressiveness
● Wide Range of Use Cases
High Performance
● Local State Access
● High Throughput/Low Latency
Stateful Processing
● State = First-class Citizen
● Event-time Support
Fault Tolerance
● Distributed State Snapshots
● Exactly-once Guarantees
14. Flink SQL
@CAITO_200_OK
● Stream processing: real-time processing
● Stream processing is complex
● Flink is highly performant streaming
● Flink solves many problems in streaming
● Flink is complex
● Flink SQL: access to Flink’s benefits
● Abstracts away the complexity
15. Flink SQL
@CAITO_200_OK
● Stream processing: real-time processing
● Stream processing is complex
● Flink is highly performant streaming
● Flink solves many problems in streaming
● Flink is complex
● Flink SQL: access to Flink’s benefits
● Abstracts away the complexity
16. Flink SQL
@CAITO_200_OK
● Stream processing: real-time processing
● Stream processing is complex
● Flink is highly performant streaming
● Flink solves many problems in streaming
● Flink is complex
● Flink SQL: access to Flink’s benefits
● Abstracts away the complexity
17. Flink SQL
@CAITO_200_OK
● Stream processing: real-time processing
● Stream processing is complex
● Flink is highly performant streaming
● Flink solves many problems in streaming
● Flink is complex
● Flink SQL: access to Flink’s benefits
● Abstracts away the complexity
18. Flink SQL
@CAITO_200_OK
● Stream processing: real-time processing
● Stream processing is complex
● Flink is highly performant streaming
● Flink solves many problems in streaming
● Flink is complex
● Flink SQL: access to Flink’s benefits
● Abstracts away the complexity
19. Flink SQL
@CAITO_200_OK
● Stream processing: real-time processing
● Stream processing is complex
● Flink is highly performant streaming
● Flink solves many problems in streaming
● Flink is complex
● Flink SQL: access to Flink’s benefits
● Abstracts away the complexity
20. Flink SQL
@CAITO_200_OK
● Stream processing: real-time processing
● Stream processing is complex
● Flink is highly performant streaming
● Flink solves many problems in streaming
● Flink is complex
● Flink SQL: access to Flink’s benefits
● Abstracts away the complexity
21. Flink SQL Demo
@CAITO_200_OK
● Making the complex simple
● You could start a data pipeline anywhere!
● Language agnostic
From: Free Guy movie
22. Flink SQL Demo > Regular SQL
@CAITO_200_OK
user cnt
Mary 2
Bob 1
SELECT user_id,
COUNT(url) AS cnt
FROM clicks
GROUP BY user_id;
Take a snapshot when the
query starts
A final result is
produced
A row that was added after the query
was started is not considered
user cTime url
Mary 12:00:00 https://…
Bob 12:00:00 https://…
Mary 12:00:02 https://…
Liz 12:00:03 https://…
The query
terminates
Image: Marta Paes @morsapaes
23. Flink SQL Demo > Flink SQL
@CAITO_200_OK
user cTime url
user cnt
SELECT user_id,
COUNT(url) AS cnt
FROM clicks
GROUP BY user_id;
Mary 12:00:00 https://…
Bob 12:00:00 https://…
Mary 12:00:02 https://…
Liz 12:00:03 https://…
Bob 1
Liz 1
Mary 1
Mary 2
Ingest all changes as
they happen
Continuously update the
result
The result is identical to the one-time query (at this point)
Image: Marta Paes @morsapaes
40. Demo > Twier Firehose
@CAITO_200_OK
Demo: Marta Paes @morsapaes
41. Demo > Twier Firehose
@CAITO_200_OK
Demo: Marta Paes @morsapaes
42. Demo > Twier Firehose
@CAITO_200_OK
CREATE CATALOG pulsar WITH (
'type' = 'pulsar',
'service-url' = 'pulsar://pulsar:6650',
'admin-url' = 'http://pulsar:8080',
'format' = 'json'
);
Catalog DDL
Demo: Marta Paes @morsapaes
43. Demo > Twier Firehose
@CAITO_200_OK
Not cool. 👹
Demo: Marta Paes @morsapaes
44. Demo > Get Relevant Timestamps
@CAITO_200_OK
CREATE TABLE pulsar_tweets (
publishTime TIMESTAMP(3) METADATA,
WATERMARK FOR publishTime AS publishTime - INTERVAL '5'
SECOND
) WITH (
'connector' = 'pulsar',
'topic' = 'persistent://public/default/tweets',
'value.format' = 'json',
'service-url' = 'pulsar://pulsar:6650',
'admin-url' = 'http://pulsar:8080',
'scan.startup.mode' = 'earliest-offset'
)
LIKE tweets;
Derive schema from the original topic
Define the source connector (Pulsar)
Read and use Pulsar message metadata
Demo: Marta Paes @morsapaes
45. Demo > Windowed Aggregation
@CAITO_200_OK
CREATE TABLE pulsar_tweets_agg (
tmstmp TIMESTAMP(3),
tweet_cnt BIGINT
) WITH (
'connector'='pulsar',
'topic'='persistent://public/default/tweets_agg',
'value.format'='json',
'service-url'='pulsar://pulsar:6650',
'admin-url'='http://pulsar:8080'
);
Sink Table DDL
INSERT INTO pulsar_tweets_agg
SELECT TUMBLE_START(publishTime, INTERVAL '10'
SECOND) AS wStart,
COUNT(id) AS tweet_cnt
FROM pulsar_tweets
GROUP BY TUMBLE(publishTime, INTERVAL '10'
SECOND);
Continuous SQL Query
Demo: Marta Paes @morsapaes
46. Demo > Tweet Count in Windows
@CAITO_200_OK
Demo: Marta Paes @morsapaes
54. New Slack Space!
@CAITO_200_OK
● Go-to space for user troubleshooting
● 800 members in less than 2 months
● Members include most of the Flink commiers
+ PMC members
60. Resources
● Flink Ahead: What Comes After Batch & Streaming: https://youtu.be/h5OYmy9Yx7Y
● Apache Pulsar as one Storage System for Real Time & Historical Data Analysis:
https://medium.com/streamnative/apache-pulsar-as-one-storage-455222c59017
● Flink Table API & SQL:
https://ci.apache.org/projects/flink/flink-docs-master/dev/table/sql/queries.html#ope
rations
● Flink SQL Cookbook: https://github.com/ververica/flink-sql-cookbook
● When Flink & Pulsar Come Together:
https://flink.apache.org/2019/05/03/pulsar-flink.html
● How to Query Pulsar Streams in Flink:
https://flink.apache.org/news/2019/11/25/query-pulsar-streams-using-apache-flink.ht
ml
● What’s New in the Flink/Pulsar Connector:
● https://flink.apache.org/2021/01/07/pulsar-flink-connector-270.html
● Marta’s Demo: https://github.com/morsapaes/flink-sql-pulsar
60
@Caito_200_OK