SlideShare a Scribd company logo
CDC Stream
Processing with
Apache Flink
Timo Walther
@twalthr
–
Current 2022
2022-10-05
About me
Open source
● Long-term committer since 2014 (before ASF)
● Member of the project management committee (PMC)
● Top 5 contributor (commits), top 1 contributor (additions)
● Among core architects of Flink SQL
Career
● Early Software Engineer @ DataArtisans
● SDK Team @ DataArtisans/Ververica (acquisition by Alibaba)
● SQL Team Lead @ Ververica
● Co-Founder @ Immerok
2
Visit us at
booth S14!
What is Apache Flink?
3
Building Blocks for Stream Processing
4
Time
● Synchronize
● Progress
● Wait
● Timeout
● Fast-forward
● Replay
State
● Store
● Buffer
● Cache
● Model
● Grow
● Expire
Streams
● Pipeline
● Distribute
● Join
● Enrich
● Control
● Replay
Snapshots
● Backup
● Version
● Fork
● A/B test
● Time-travel
● Restore
What makes Apache Flink unique?
5
© 2022
Source 1 Normalize
Join Sink
Source 2 Filter
Shard 1
Shard 2
Subtask 1
Subtask 2
Partition 1 Subtask 1
Subtask 1
Subtask 2
Partition 1
Partition 2
fast local state that scales with the
operator
long-term durable storage
What is Apache Flink used for?
6
Transactions
Logs
IoT
Interactions
Events
…
Analytics
Event-driven
Applications
Data
Integration
ETL
Messaging
Systems
Files
Databases
Key/Value Stores
Applications
Messaging
Systems
Files
Databases
Key/Value Stores
Apache Flink’s APIs
7
API Stack
8
Dataflow Runtime
Low-Level Stream Operator API
Optimizer / Planner
Table / SQL API
DataStream API Stateful Functions
DataStream API
9
StreamExecutionEnvironment env =
StreamExecutionEnvironment.getExecutionEnvironment();
env.setRuntimeMode(STREAMING);
DataStream<Integer> stream = env.fromElements(1, 2, 3);
stream.executeAndCollect().forEachRemaining(System.out::println);
Properties
● Exposes the building blocks for stream processing
● Arbitrary operator topologies using map(), process(), connect(), ...
● Business logic is written in user-defined functions
● Arbitrary user-defined record types flow in-between
● Conceptually always an append-only / insert-only log!
1
2
3
Output
Table / SQL API
10
TableEnvironment env =
TableEnvironment.create(EnvironmentSettings.inStreamingMode());
// Programmatic
Table table = env.fromValues(row(1), row(2), row(3));
// SQL
Table table = env.sqlQuery("SELECT * FROM (VALUES (1), (2), (3))");
table.execute().print();
Properties
● Abstracts the building blocks for stream processing
● Operator topology is determined by planner
● Business logic is declared in SQL and/or Table API
● Internal record types flow, Flink’s Row type is exposed in Table API
● Conceptually a table, but a changelog under the hood!
+----+-------------+
| op | f0 |
+----+-------------+
| +I | 1 |
| +I | 2 |
| +I | 3 |
Output
DataStream API ↔Table / SQL API
11
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
StreamTableEnvironment tableEnv = StreamTableEnvironment.create(env);
// Stream -> Table
DataStream<?> inStream1 = ...
Table appendOnlyTable = tableEnv.fromDataStream(inStream1)
DataStream<Row> inStream2 = ...
Table anyTable = tableEnv.fromChangelogStream(inStream2)
// Table -> Stream
DataStream<T> appendOnlyStream = tableEnv.toDataStream(insertOnlyTable, T.class)
DataStream<Row> changelogStream = tableEnv.toChangelogStream(anyTable)
Mix and match APIs!
Changelog Stream
Processing
12
Data Processing is a Stream of Changes
13
● Business data is always a stream: bounded or unbounded
● Every record is a changelog entry: insertion as the default
● Batch processing is just a special case in the runtime
now
past future
start end of stream
bounded stream unbounded stream
unbounded stream
How do I Work with Streams in Flink SQL?
14
● You don’t. You work with dynamic tables!
● A concept similar to materialized views
CREATE TABLE Revenue
(name STRING, total INT)
WITH (…)
INSERT INTO Revenue
SELECT name, SUM(amount)
FROM Transactions
GROUP BY name
CREATE TABLE Transactions
(name STRING, amount INT)
WITH (…)
name amount
Alice 56
Bob 10
Alice 89
name total
Alice 145
Bob 10
So, is Flink SQL a database? No, bring your own data and systems!
Stream-Table Duality - Basics
15
● A stream is the changelog of a dynamic table
● Sources, operators, and sinks work on changelogs under the hood
● Each component declares the kind of changes it consumes/produces
only +I Appending/Insert-only
contains -… Updating
contains -U Retracting
never –U but +U Upserting
Short name Long name
+I Insertion Default for scans + output of bounded results.
-U Update Before Retracts a previously emitted result.
+U Update After Updates a previously emitted result.
Requires a primary key if -U is omitted for idempotent updates.
-D Delete Removes the last result.
Stream-Table Duality - Example
16
An applied changelog becomes a real (materialized) table.
name amount
Alice 56
Bob 10
Alice 89
name total
Alice 56
Bob 10
changelog
+I[Alice, 89] +I[Bob, 10] +I[Alice, 56] +U[Alice, 145] -U[Alice, 56] +I[Bob, 10] +I[Alice, 56]
145
materialization
CREATE TABLE Revenue
(name STRING, total INT)
WITH (…)
INSERT INTO Revenue
SELECT name, SUM(amount)
FROM Transactions
GROUP BY name
CREATE TABLE Transactions
(name STRING, amount INT)
WITH (…)
Stream-Table Duality - Example
17
An applied changelog becomes a real (materialized) table.
name amount
Alice 56
Bob 10
Alice 89
name total
Alice 56
Bob 10
+I[Alice, 89] +I[Bob, 10] +I[Alice, 56] +U[Alice, 145] -U[Alice, 56] +I[Bob, 10] +I[Alice, 56]
145
materialization
CREATE TABLE Revenue
(PRIMARY KEY(name) …)
WITH (…)
INSERT INTO Revenue
SELECT name, SUM(amount)
FROM Transactions
GROUP BY name
CREATE TABLE Transactions
(name STRING, amount INT)
WITH (…)
Save ~50% of traffic if downstream system supports upserting!
Stream-Table Duality - Propagation
18
● Source declares set of emitted changes i.e. changelog mode
● Optimizer tracks changelog mode and primary key through pipeline
● Sink declares changes it can digest
CREATE TABLE …
… WITH ('connector'='filesystem')
… WITH ('connector'='kafka')
… WITH ('connector'='kafka-upsert')
… WITH ('connector'='jdbc')
… WITH ('connector'='kafka', 'format' = 'debezium-json')
+I
+I
+I -D
+I -U +U -D
+I
(for sources)
Retract vs. Upsert
19
Retract
● No primary key requirements
● Works for almost every external system
● Supports duplicate rows
● In distributed system often unavoidable
à most flexible changelog mode
à default mode
Upsert
● Traffic + computation optimization
● In-place updates (idempotency)
SELECT c, COUNT(*) FROM (
SELECT COUNT(*) AS c
FROM T
GROUP BY user
)
GROUP BY c
Count 1
Subtask 1
Count 2
Subtask 1
Subtask 2
+U[1]
+U[2]
+I[…]
1=>1
2=>1
Subtask 2
+I[…]
Changelog Insights – Append-only
20
CREATE TABLE Transaction (tid BIGINT, amount INT);
CREATE TABLE Payment (tid BIGINT, method STRING);
CREATE TABLE Result (tid BIGINT, …); // accepts all changes
INSERT INTO Result SELECT * FROM Transactions T JOIN Payments P ON T.tid = P.tid;
Sink(table=[Result], changelogMode=[NONE])
+- Join(leftInputSpec=[NoUniqueKey], rightInputSpec=[NoUniqueKey], changelogMode=[I])
:- Exchange(changelogMode=[I])
: +- TableSourceScan(table=[[Transaction]], changelogMode=[I])
+- Exchange(changelogMode=[I])
+- TableSourceScan(table=[[Payment]], changelogMode=[I])
Changelog Insights – Updating
21
CREATE TABLE Transaction (tid BIGINT, amount INT);
CREATE TABLE Payment (tid BIGINT, method STRING);
CREATE TABLE Result (tid BIGINT, …);
INSERT INTO Result SELECT * FROM Transactions T LEFT JOIN Payments P ON T.tid = P.tid;
Sink(table=[Result], changelogMode=[NONE])
+- Join(leftInputSpec=[NoUniqueKey], rightInputSpec=[NoUniqueKey], changelogMode=[I,UB,UA,D])
:- Exchange(changelogMode=[I])
: +- TableSourceScan(table=[[Transaction]], changelogMode=[I])
+- Exchange(changelogMode=[I])
+- TableSourceScan(table=[[Payment]], changelogMode=[I])
Changelog Insights – Updating with PK
22
CREATE TABLE Transaction (tid BIGINT, amount INT);
CREATE TABLE Payment (tid BIGINT, method STRING);
CREATE TABLE Result (tid BIGINT, …, PRIMARY KEY(tid) NOT ENFORCED);
INSERT INTO Result SELECT * FROM Transactions T LEFT JOIN Payments P ON T.tid = P.tid;
Sink(table=[Result], changelogMode=[NONE], upsertMaterialize=[true])
+- Join(leftInputSpec=[NoUniqueKey], rightInputSpec=[NoUniqueKey], changelogMode=[I,UB,UA,D])
:- Exchange(changelogMode=[I])
: +- TableSourceScan(table=[[Transaction]], changelogMode=[I])
+- Exchange(changelogMode=[I])
+- TableSourceScan(table=[[Payment]], changelogMode=[I])
Changelog Insights – Updating with PK
23
CREATE TABLE Transaction (tid BIGINT, …, PRIMARY KEY(tid) NOT ENFORCED);
CREATE TABLE Payment (tid BIGINT, …, PRIMARY KEY(tid) NOT ENFORCED);
CREATE TABLE Result (tid BIGINT, …, PRIMARY KEY(tid) NOT ENFORCED);
INSERT INTO Result SELECT * FROM Transactions T LEFT JOIN Payments P ON T.tid = P.tid;
Sink(table=[Result], changelogMode=[NONE])
+- Join(leftInputSpec=[UniqueKey], rightInputSpec=[UniqueKey], changelogMode=[I,UA,D])
:- Exchange(changelogMode=[I])
: +- TableSourceScan(table=[[Transaction]], changelogMode=[I])
+- Exchange(changelogMode=[I])
+- TableSourceScan(table=[[Payment]], changelogMode=[I])
Mode Transitions
24
Append-only
Retracting
Updating
through operation
if operator/sink requires it
ChangelogNormalize
if sink requires it
UpsertMaterialize
Mode Transitions – Characteristics
25
Append-only
● Event-time column backed
by watermarks
● Highly state efficient due to
notion of completeness
● Usually no event-time
column
● State usage needs to
be kept in mind
● Pure materialized view
maintenance
Retracting
Updating
aka "TABLE"
aka "STREAM"
aka ?
Mode Transitions – Joins
26
Append-only Append-only
regular join
Append-only Updating
Append-only
Updating
Append-only Append-only
regular
outer join
Updating
regular join
Append-only Updating
temporal
join
Append-only
Mode Transitions – Temporal Join
27
SELECT
order_id,
price,
currency,
conversion_rate,
order_time
FROM Orders
LEFT JOIN CurrencyRates FOR SYSTEM_TIME AS OF Orders.order_time
ON Orders.currency = CurrencyRates.currency;
CREATE TABLE CurrencyRates (
WATERMARK FOR update_time AS …, PRIMARY KEY(currency) NOT ENFORCED,…);
Mode Transitions – Explicit Transition without PK
28
Append-only Updating
op update_time currency rate
== =========== ======== ====
+I 09:00:00 Yen 102
+I 09:00:00 Euro 114
+I 09:00:00 USD 1
+I 11:15:00 Euro 119
+I 11:49:00 Pounds 108
op update_time currency rate
== =========== ======== ====
+I 09:00:00 Yen 102
+I 09:00:00 Euro 114
+I 09:00:00 USD 1
+U 11:15:00 Euro 119
+I 11:49:00 Pounds 108
Mode Transitions – Explicit Transition without PK
29
Append-only Updating
CREATE VIEW versioned_rates AS
SELECT currency, rate, update_time
FROM (
SELECT
*,
ROW_NUMBER() OVER (PARTITION BY currency ORDER BY update_time DESC) AS rownum
FROM currency_rates
)
WHERE rownum = 1;
Demo
https://github.com/twalthr/flink-api-examples
30
Summary
TLDR
● Flink's SQL engine is a powerful changelog processor
● Flexible tool for integrating systems with different semantics
There is more…
● CDC connector ecosystem
à 2.6k Github stars
https://flink-packages.org/packages/cdc-connectors
● Table Store
à unified storage engine for dynamic tables
https://flink.apache.org/news/2022/05/11/release-table-store-0.1.0.html
● SQL Gateway
https://cwiki.apache.org/confluence/display/FLINK/FLIP-91%3A+Support+SQL+Gateway
31
Thanks
Timo Walther
@twalthr
mrsql@immerok.io

More Related Content

What's hot

Autoscaling Flink with Reactive Mode
Autoscaling Flink with Reactive ModeAutoscaling Flink with Reactive Mode
Autoscaling Flink with Reactive Mode
Flink Forward
 
Best Practices for ETL with Apache NiFi on Kubernetes - Albert Lewandowski, G...
Best Practices for ETL with Apache NiFi on Kubernetes - Albert Lewandowski, G...Best Practices for ETL with Apache NiFi on Kubernetes - Albert Lewandowski, G...
Best Practices for ETL with Apache NiFi on Kubernetes - Albert Lewandowski, G...
GetInData
 
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
HostedbyConfluent
 
Apache Hudi: The Path Forward
Apache Hudi: The Path ForwardApache Hudi: The Path Forward
Apache Hudi: The Path Forward
Alluxio, Inc.
 
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
Flink Forward
 
Bringing Kafka Without Zookeeper Into Production with Colin McCabe | Kafka Su...
Bringing Kafka Without Zookeeper Into Production with Colin McCabe | Kafka Su...Bringing Kafka Without Zookeeper Into Production with Colin McCabe | Kafka Su...
Bringing Kafka Without Zookeeper Into Production with Colin McCabe | Kafka Su...
HostedbyConfluent
 
Hudi architecture, fundamentals and capabilities
Hudi architecture, fundamentals and capabilitiesHudi architecture, fundamentals and capabilities
Hudi architecture, fundamentals and capabilities
Nishith Agarwal
 
Real-time Analytics with Trino and Apache Pinot
Real-time Analytics with Trino and Apache PinotReal-time Analytics with Trino and Apache Pinot
Real-time Analytics with Trino and Apache Pinot
Xiang Fu
 
Extending Flink SQL for stream processing use cases
Extending Flink SQL for stream processing use casesExtending Flink SQL for stream processing use cases
Extending Flink SQL for stream processing use cases
Flink Forward
 
Batch Processing at Scale with Flink & Iceberg
Batch Processing at Scale with Flink & IcebergBatch Processing at Scale with Flink & Iceberg
Batch Processing at Scale with Flink & Iceberg
Flink Forward
 
Virtual Flink Forward 2020: A deep dive into Flink SQL - Jark Wu
Virtual Flink Forward 2020: A deep dive into Flink SQL - Jark WuVirtual Flink Forward 2020: A deep dive into Flink SQL - Jark Wu
Virtual Flink Forward 2020: A deep dive into Flink SQL - Jark Wu
Flink Forward
 
“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...
“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...
“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...
Flink Forward
 
Fundamentals of Apache Kafka
Fundamentals of Apache KafkaFundamentals of Apache Kafka
Fundamentals of Apache Kafka
Chhavi Parasher
 
Introduction to Apache ZooKeeper
Introduction to Apache ZooKeeperIntroduction to Apache ZooKeeper
Introduction to Apache ZooKeeper
Saurav Haloi
 
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
Exactly-Once Financial Data Processing at Scale with Flink and PinotExactly-Once Financial Data Processing at Scale with Flink and Pinot
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
Flink Forward
 
Practical learnings from running thousands of Flink jobs
Practical learnings from running thousands of Flink jobsPractical learnings from running thousands of Flink jobs
Practical learnings from running thousands of Flink jobs
Flink Forward
 
Tame the small files problem and optimize data layout for streaming ingestion...
Tame the small files problem and optimize data layout for streaming ingestion...Tame the small files problem and optimize data layout for streaming ingestion...
Tame the small files problem and optimize data layout for streaming ingestion...
Flink Forward
 
Introducing the Apache Flink Kubernetes Operator
Introducing the Apache Flink Kubernetes OperatorIntroducing the Apache Flink Kubernetes Operator
Introducing the Apache Flink Kubernetes Operator
Flink Forward
 
Introduction to Kafka Cruise Control
Introduction to Kafka Cruise ControlIntroduction to Kafka Cruise Control
Introduction to Kafka Cruise Control
Jiangjie Qin
 
Envoy and Kafka
Envoy and KafkaEnvoy and Kafka
Envoy and Kafka
Adam Kotwasinski
 

What's hot (20)

Autoscaling Flink with Reactive Mode
Autoscaling Flink with Reactive ModeAutoscaling Flink with Reactive Mode
Autoscaling Flink with Reactive Mode
 
Best Practices for ETL with Apache NiFi on Kubernetes - Albert Lewandowski, G...
Best Practices for ETL with Apache NiFi on Kubernetes - Albert Lewandowski, G...Best Practices for ETL with Apache NiFi on Kubernetes - Albert Lewandowski, G...
Best Practices for ETL with Apache NiFi on Kubernetes - Albert Lewandowski, G...
 
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
 
Apache Hudi: The Path Forward
Apache Hudi: The Path ForwardApache Hudi: The Path Forward
Apache Hudi: The Path Forward
 
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
 
Bringing Kafka Without Zookeeper Into Production with Colin McCabe | Kafka Su...
Bringing Kafka Without Zookeeper Into Production with Colin McCabe | Kafka Su...Bringing Kafka Without Zookeeper Into Production with Colin McCabe | Kafka Su...
Bringing Kafka Without Zookeeper Into Production with Colin McCabe | Kafka Su...
 
Hudi architecture, fundamentals and capabilities
Hudi architecture, fundamentals and capabilitiesHudi architecture, fundamentals and capabilities
Hudi architecture, fundamentals and capabilities
 
Real-time Analytics with Trino and Apache Pinot
Real-time Analytics with Trino and Apache PinotReal-time Analytics with Trino and Apache Pinot
Real-time Analytics with Trino and Apache Pinot
 
Extending Flink SQL for stream processing use cases
Extending Flink SQL for stream processing use casesExtending Flink SQL for stream processing use cases
Extending Flink SQL for stream processing use cases
 
Batch Processing at Scale with Flink & Iceberg
Batch Processing at Scale with Flink & IcebergBatch Processing at Scale with Flink & Iceberg
Batch Processing at Scale with Flink & Iceberg
 
Virtual Flink Forward 2020: A deep dive into Flink SQL - Jark Wu
Virtual Flink Forward 2020: A deep dive into Flink SQL - Jark WuVirtual Flink Forward 2020: A deep dive into Flink SQL - Jark Wu
Virtual Flink Forward 2020: A deep dive into Flink SQL - Jark Wu
 
“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...
“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...
“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...
 
Fundamentals of Apache Kafka
Fundamentals of Apache KafkaFundamentals of Apache Kafka
Fundamentals of Apache Kafka
 
Introduction to Apache ZooKeeper
Introduction to Apache ZooKeeperIntroduction to Apache ZooKeeper
Introduction to Apache ZooKeeper
 
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
Exactly-Once Financial Data Processing at Scale with Flink and PinotExactly-Once Financial Data Processing at Scale with Flink and Pinot
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
 
Practical learnings from running thousands of Flink jobs
Practical learnings from running thousands of Flink jobsPractical learnings from running thousands of Flink jobs
Practical learnings from running thousands of Flink jobs
 
Tame the small files problem and optimize data layout for streaming ingestion...
Tame the small files problem and optimize data layout for streaming ingestion...Tame the small files problem and optimize data layout for streaming ingestion...
Tame the small files problem and optimize data layout for streaming ingestion...
 
Introducing the Apache Flink Kubernetes Operator
Introducing the Apache Flink Kubernetes OperatorIntroducing the Apache Flink Kubernetes Operator
Introducing the Apache Flink Kubernetes Operator
 
Introduction to Kafka Cruise Control
Introduction to Kafka Cruise ControlIntroduction to Kafka Cruise Control
Introduction to Kafka Cruise Control
 
Envoy and Kafka
Envoy and KafkaEnvoy and Kafka
Envoy and Kafka
 

Similar to CDC Stream Processing with Apache Flink

Changelog Stream Processing with Apache Flink
Changelog Stream Processing with Apache FlinkChangelog Stream Processing with Apache Flink
Changelog Stream Processing with Apache Flink
Flink Forward
 
CDC Stream Processing With Apache Flink With Timo Walther | Current 2022
CDC Stream Processing With Apache Flink With Timo Walther | Current 2022CDC Stream Processing With Apache Flink With Timo Walther | Current 2022
CDC Stream Processing With Apache Flink With Timo Walther | Current 2022
HostedbyConfluent
 
Flink's SQL Engine: Let's Open the Engine Room!
Flink's SQL Engine: Let's Open the Engine Room!Flink's SQL Engine: Let's Open the Engine Room!
Flink's SQL Engine: Let's Open the Engine Room!
HostedbyConfluent
 
Why and how to leverage the power and simplicity of SQL on Apache Flink
Why and how to leverage the power and simplicity of SQL on Apache FlinkWhy and how to leverage the power and simplicity of SQL on Apache Flink
Why and how to leverage the power and simplicity of SQL on Apache Flink
Fabian Hueske
 
The Current State of Table API in 2022
The Current State of Table API in 2022The Current State of Table API in 2022
The Current State of Table API in 2022
Flink Forward
 
20191116 custom operators in swift
20191116 custom operators in swift20191116 custom operators in swift
20191116 custom operators in swift
Chiwon Song
 
Fs2 - Crash Course
Fs2 - Crash CourseFs2 - Crash Course
Fs2 - Crash Course
Lukasz Byczynski
 
Refactoring to Macros with Clojure
Refactoring to Macros with ClojureRefactoring to Macros with Clojure
Refactoring to Macros with ClojureDmitry Buzdin
 
Flink Batch Processing and Iterations
Flink Batch Processing and IterationsFlink Batch Processing and Iterations
Flink Batch Processing and Iterations
Sameer Wadkar
 
Tableau + Redshift views for dummies
Tableau + Redshift views for dummiesTableau + Redshift views for dummies
Tableau + Redshift views for dummies
Ivan Magrans
 
Erlang/OTP in Riak
Erlang/OTP in RiakErlang/OTP in Riak
Erlang/OTP in Riak
Sargun Dhillon
 
MCRL2
MCRL2MCRL2
Materialized Views and Secondary Indexes in Scylla: They Are finally here!
Materialized Views and Secondary Indexes in Scylla: They Are finally here!Materialized Views and Secondary Indexes in Scylla: They Are finally here!
Materialized Views and Secondary Indexes in Scylla: They Are finally here!
ScyllaDB
 
Mcrl2 by kashif.namal@gmail.com, adnanskyousafzai@gmail.com
Mcrl2 by kashif.namal@gmail.com, adnanskyousafzai@gmail.comMcrl2 by kashif.namal@gmail.com, adnanskyousafzai@gmail.com
Mcrl2 by kashif.namal@gmail.com, adnanskyousafzai@gmail.com
kashif kashif
 
Foundations of streaming SQL: stream & table theory
Foundations of streaming SQL: stream & table theoryFoundations of streaming SQL: stream & table theory
Foundations of streaming SQL: stream & table theory
DataWorks Summit
 
Job Queue in Golang
Job Queue in GolangJob Queue in Golang
Job Queue in Golang
Bo-Yi Wu
 
Writing MySQL User-defined Functions in JavaScript
Writing MySQL User-defined Functions in JavaScriptWriting MySQL User-defined Functions in JavaScript
Writing MySQL User-defined Functions in JavaScriptRoland Bouman
 
Ct es past_present_future_nycpgday_20130322
Ct es past_present_future_nycpgday_20130322Ct es past_present_future_nycpgday_20130322
Ct es past_present_future_nycpgday_20130322David Fetter
 
Functional programming in Swift
Functional programming in SwiftFunctional programming in Swift
Functional programming in Swift
John Pham
 

Similar to CDC Stream Processing with Apache Flink (20)

Changelog Stream Processing with Apache Flink
Changelog Stream Processing with Apache FlinkChangelog Stream Processing with Apache Flink
Changelog Stream Processing with Apache Flink
 
CDC Stream Processing With Apache Flink With Timo Walther | Current 2022
CDC Stream Processing With Apache Flink With Timo Walther | Current 2022CDC Stream Processing With Apache Flink With Timo Walther | Current 2022
CDC Stream Processing With Apache Flink With Timo Walther | Current 2022
 
Flink's SQL Engine: Let's Open the Engine Room!
Flink's SQL Engine: Let's Open the Engine Room!Flink's SQL Engine: Let's Open the Engine Room!
Flink's SQL Engine: Let's Open the Engine Room!
 
Pdxpugday2010 pg90
Pdxpugday2010 pg90Pdxpugday2010 pg90
Pdxpugday2010 pg90
 
Why and how to leverage the power and simplicity of SQL on Apache Flink
Why and how to leverage the power and simplicity of SQL on Apache FlinkWhy and how to leverage the power and simplicity of SQL on Apache Flink
Why and how to leverage the power and simplicity of SQL on Apache Flink
 
The Current State of Table API in 2022
The Current State of Table API in 2022The Current State of Table API in 2022
The Current State of Table API in 2022
 
20191116 custom operators in swift
20191116 custom operators in swift20191116 custom operators in swift
20191116 custom operators in swift
 
Fs2 - Crash Course
Fs2 - Crash CourseFs2 - Crash Course
Fs2 - Crash Course
 
Refactoring to Macros with Clojure
Refactoring to Macros with ClojureRefactoring to Macros with Clojure
Refactoring to Macros with Clojure
 
Flink Batch Processing and Iterations
Flink Batch Processing and IterationsFlink Batch Processing and Iterations
Flink Batch Processing and Iterations
 
Tableau + Redshift views for dummies
Tableau + Redshift views for dummiesTableau + Redshift views for dummies
Tableau + Redshift views for dummies
 
Erlang/OTP in Riak
Erlang/OTP in RiakErlang/OTP in Riak
Erlang/OTP in Riak
 
MCRL2
MCRL2MCRL2
MCRL2
 
Materialized Views and Secondary Indexes in Scylla: They Are finally here!
Materialized Views and Secondary Indexes in Scylla: They Are finally here!Materialized Views and Secondary Indexes in Scylla: They Are finally here!
Materialized Views and Secondary Indexes in Scylla: They Are finally here!
 
Mcrl2 by kashif.namal@gmail.com, adnanskyousafzai@gmail.com
Mcrl2 by kashif.namal@gmail.com, adnanskyousafzai@gmail.comMcrl2 by kashif.namal@gmail.com, adnanskyousafzai@gmail.com
Mcrl2 by kashif.namal@gmail.com, adnanskyousafzai@gmail.com
 
Foundations of streaming SQL: stream & table theory
Foundations of streaming SQL: stream & table theoryFoundations of streaming SQL: stream & table theory
Foundations of streaming SQL: stream & table theory
 
Job Queue in Golang
Job Queue in GolangJob Queue in Golang
Job Queue in Golang
 
Writing MySQL User-defined Functions in JavaScript
Writing MySQL User-defined Functions in JavaScriptWriting MySQL User-defined Functions in JavaScript
Writing MySQL User-defined Functions in JavaScript
 
Ct es past_present_future_nycpgday_20130322
Ct es past_present_future_nycpgday_20130322Ct es past_present_future_nycpgday_20130322
Ct es past_present_future_nycpgday_20130322
 
Functional programming in Swift
Functional programming in SwiftFunctional programming in Swift
Functional programming in Swift
 

Recently uploaded

OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoam
OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoamOpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoam
OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoam
takuyayamamoto1800
 
Orion Context Broker introduction 20240604
Orion Context Broker introduction 20240604Orion Context Broker introduction 20240604
Orion Context Broker introduction 20240604
Fermin Galan
 
Into the Box 2024 - Keynote Day 2 Slides.pdf
Into the Box 2024 - Keynote Day 2 Slides.pdfInto the Box 2024 - Keynote Day 2 Slides.pdf
Into the Box 2024 - Keynote Day 2 Slides.pdf
Ortus Solutions, Corp
 
Webinar: Salesforce Document Management 2.0 - Smarter, Faster, Better
Webinar: Salesforce Document Management 2.0 - Smarter, Faster, BetterWebinar: Salesforce Document Management 2.0 - Smarter, Faster, Better
Webinar: Salesforce Document Management 2.0 - Smarter, Faster, Better
XfilesPro
 
How to Position Your Globus Data Portal for Success Ten Good Practices
How to Position Your Globus Data Portal for Success Ten Good PracticesHow to Position Your Globus Data Portal for Success Ten Good Practices
How to Position Your Globus Data Portal for Success Ten Good Practices
Globus
 
TROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERROR
TROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERRORTROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERROR
TROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERROR
Tier1 app
 
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...
Globus
 
Quarkus Hidden and Forbidden Extensions
Quarkus Hidden and Forbidden ExtensionsQuarkus Hidden and Forbidden Extensions
Quarkus Hidden and Forbidden Extensions
Max Andersen
 
Globus Connect Server Deep Dive - GlobusWorld 2024
Globus Connect Server Deep Dive - GlobusWorld 2024Globus Connect Server Deep Dive - GlobusWorld 2024
Globus Connect Server Deep Dive - GlobusWorld 2024
Globus
 
Vitthal Shirke Microservices Resume Montevideo
Vitthal Shirke Microservices Resume MontevideoVitthal Shirke Microservices Resume Montevideo
Vitthal Shirke Microservices Resume Montevideo
Vitthal Shirke
 
Graphic Design Crash Course for beginners
Graphic Design Crash Course for beginnersGraphic Design Crash Course for beginners
Graphic Design Crash Course for beginners
e20449
 
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
Globus
 
A Comprehensive Look at Generative AI in Retail App Testing.pdf
A Comprehensive Look at Generative AI in Retail App Testing.pdfA Comprehensive Look at Generative AI in Retail App Testing.pdf
A Comprehensive Look at Generative AI in Retail App Testing.pdf
kalichargn70th171
 
Enhancing Research Orchestration Capabilities at ORNL.pdf
Enhancing Research Orchestration Capabilities at ORNL.pdfEnhancing Research Orchestration Capabilities at ORNL.pdf
Enhancing Research Orchestration Capabilities at ORNL.pdf
Globus
 
SOCRadar Research Team: Latest Activities of IntelBroker
SOCRadar Research Team: Latest Activities of IntelBrokerSOCRadar Research Team: Latest Activities of IntelBroker
SOCRadar Research Team: Latest Activities of IntelBroker
SOCRadar
 
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...
Juraj Vysvader
 
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
Globus
 
Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...
Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...
Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...
Mind IT Systems
 
Globus Compute wth IRI Workflows - GlobusWorld 2024
Globus Compute wth IRI Workflows - GlobusWorld 2024Globus Compute wth IRI Workflows - GlobusWorld 2024
Globus Compute wth IRI Workflows - GlobusWorld 2024
Globus
 
top nidhi software solution freedownload
top nidhi software solution freedownloadtop nidhi software solution freedownload
top nidhi software solution freedownload
vrstrong314
 

Recently uploaded (20)

OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoam
OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoamOpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoam
OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoam
 
Orion Context Broker introduction 20240604
Orion Context Broker introduction 20240604Orion Context Broker introduction 20240604
Orion Context Broker introduction 20240604
 
Into the Box 2024 - Keynote Day 2 Slides.pdf
Into the Box 2024 - Keynote Day 2 Slides.pdfInto the Box 2024 - Keynote Day 2 Slides.pdf
Into the Box 2024 - Keynote Day 2 Slides.pdf
 
Webinar: Salesforce Document Management 2.0 - Smarter, Faster, Better
Webinar: Salesforce Document Management 2.0 - Smarter, Faster, BetterWebinar: Salesforce Document Management 2.0 - Smarter, Faster, Better
Webinar: Salesforce Document Management 2.0 - Smarter, Faster, Better
 
How to Position Your Globus Data Portal for Success Ten Good Practices
How to Position Your Globus Data Portal for Success Ten Good PracticesHow to Position Your Globus Data Portal for Success Ten Good Practices
How to Position Your Globus Data Portal for Success Ten Good Practices
 
TROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERROR
TROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERRORTROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERROR
TROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERROR
 
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...
 
Quarkus Hidden and Forbidden Extensions
Quarkus Hidden and Forbidden ExtensionsQuarkus Hidden and Forbidden Extensions
Quarkus Hidden and Forbidden Extensions
 
Globus Connect Server Deep Dive - GlobusWorld 2024
Globus Connect Server Deep Dive - GlobusWorld 2024Globus Connect Server Deep Dive - GlobusWorld 2024
Globus Connect Server Deep Dive - GlobusWorld 2024
 
Vitthal Shirke Microservices Resume Montevideo
Vitthal Shirke Microservices Resume MontevideoVitthal Shirke Microservices Resume Montevideo
Vitthal Shirke Microservices Resume Montevideo
 
Graphic Design Crash Course for beginners
Graphic Design Crash Course for beginnersGraphic Design Crash Course for beginners
Graphic Design Crash Course for beginners
 
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
 
A Comprehensive Look at Generative AI in Retail App Testing.pdf
A Comprehensive Look at Generative AI in Retail App Testing.pdfA Comprehensive Look at Generative AI in Retail App Testing.pdf
A Comprehensive Look at Generative AI in Retail App Testing.pdf
 
Enhancing Research Orchestration Capabilities at ORNL.pdf
Enhancing Research Orchestration Capabilities at ORNL.pdfEnhancing Research Orchestration Capabilities at ORNL.pdf
Enhancing Research Orchestration Capabilities at ORNL.pdf
 
SOCRadar Research Team: Latest Activities of IntelBroker
SOCRadar Research Team: Latest Activities of IntelBrokerSOCRadar Research Team: Latest Activities of IntelBroker
SOCRadar Research Team: Latest Activities of IntelBroker
 
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...
 
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
 
Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...
Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...
Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...
 
Globus Compute wth IRI Workflows - GlobusWorld 2024
Globus Compute wth IRI Workflows - GlobusWorld 2024Globus Compute wth IRI Workflows - GlobusWorld 2024
Globus Compute wth IRI Workflows - GlobusWorld 2024
 
top nidhi software solution freedownload
top nidhi software solution freedownloadtop nidhi software solution freedownload
top nidhi software solution freedownload
 

CDC Stream Processing with Apache Flink

  • 1. CDC Stream Processing with Apache Flink Timo Walther @twalthr – Current 2022 2022-10-05
  • 2. About me Open source ● Long-term committer since 2014 (before ASF) ● Member of the project management committee (PMC) ● Top 5 contributor (commits), top 1 contributor (additions) ● Among core architects of Flink SQL Career ● Early Software Engineer @ DataArtisans ● SDK Team @ DataArtisans/Ververica (acquisition by Alibaba) ● SQL Team Lead @ Ververica ● Co-Founder @ Immerok 2 Visit us at booth S14!
  • 3. What is Apache Flink? 3
  • 4. Building Blocks for Stream Processing 4 Time ● Synchronize ● Progress ● Wait ● Timeout ● Fast-forward ● Replay State ● Store ● Buffer ● Cache ● Model ● Grow ● Expire Streams ● Pipeline ● Distribute ● Join ● Enrich ● Control ● Replay Snapshots ● Backup ● Version ● Fork ● A/B test ● Time-travel ● Restore
  • 5. What makes Apache Flink unique? 5 © 2022 Source 1 Normalize Join Sink Source 2 Filter Shard 1 Shard 2 Subtask 1 Subtask 2 Partition 1 Subtask 1 Subtask 1 Subtask 2 Partition 1 Partition 2 fast local state that scales with the operator long-term durable storage
  • 6. What is Apache Flink used for? 6 Transactions Logs IoT Interactions Events … Analytics Event-driven Applications Data Integration ETL Messaging Systems Files Databases Key/Value Stores Applications Messaging Systems Files Databases Key/Value Stores
  • 8. API Stack 8 Dataflow Runtime Low-Level Stream Operator API Optimizer / Planner Table / SQL API DataStream API Stateful Functions
  • 9. DataStream API 9 StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment(); env.setRuntimeMode(STREAMING); DataStream<Integer> stream = env.fromElements(1, 2, 3); stream.executeAndCollect().forEachRemaining(System.out::println); Properties ● Exposes the building blocks for stream processing ● Arbitrary operator topologies using map(), process(), connect(), ... ● Business logic is written in user-defined functions ● Arbitrary user-defined record types flow in-between ● Conceptually always an append-only / insert-only log! 1 2 3 Output
  • 10. Table / SQL API 10 TableEnvironment env = TableEnvironment.create(EnvironmentSettings.inStreamingMode()); // Programmatic Table table = env.fromValues(row(1), row(2), row(3)); // SQL Table table = env.sqlQuery("SELECT * FROM (VALUES (1), (2), (3))"); table.execute().print(); Properties ● Abstracts the building blocks for stream processing ● Operator topology is determined by planner ● Business logic is declared in SQL and/or Table API ● Internal record types flow, Flink’s Row type is exposed in Table API ● Conceptually a table, but a changelog under the hood! +----+-------------+ | op | f0 | +----+-------------+ | +I | 1 | | +I | 2 | | +I | 3 | Output
  • 11. DataStream API ↔Table / SQL API 11 StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment(); StreamTableEnvironment tableEnv = StreamTableEnvironment.create(env); // Stream -> Table DataStream<?> inStream1 = ... Table appendOnlyTable = tableEnv.fromDataStream(inStream1) DataStream<Row> inStream2 = ... Table anyTable = tableEnv.fromChangelogStream(inStream2) // Table -> Stream DataStream<T> appendOnlyStream = tableEnv.toDataStream(insertOnlyTable, T.class) DataStream<Row> changelogStream = tableEnv.toChangelogStream(anyTable) Mix and match APIs!
  • 13. Data Processing is a Stream of Changes 13 ● Business data is always a stream: bounded or unbounded ● Every record is a changelog entry: insertion as the default ● Batch processing is just a special case in the runtime now past future start end of stream bounded stream unbounded stream unbounded stream
  • 14. How do I Work with Streams in Flink SQL? 14 ● You don’t. You work with dynamic tables! ● A concept similar to materialized views CREATE TABLE Revenue (name STRING, total INT) WITH (…) INSERT INTO Revenue SELECT name, SUM(amount) FROM Transactions GROUP BY name CREATE TABLE Transactions (name STRING, amount INT) WITH (…) name amount Alice 56 Bob 10 Alice 89 name total Alice 145 Bob 10 So, is Flink SQL a database? No, bring your own data and systems!
  • 15. Stream-Table Duality - Basics 15 ● A stream is the changelog of a dynamic table ● Sources, operators, and sinks work on changelogs under the hood ● Each component declares the kind of changes it consumes/produces only +I Appending/Insert-only contains -… Updating contains -U Retracting never –U but +U Upserting Short name Long name +I Insertion Default for scans + output of bounded results. -U Update Before Retracts a previously emitted result. +U Update After Updates a previously emitted result. Requires a primary key if -U is omitted for idempotent updates. -D Delete Removes the last result.
  • 16. Stream-Table Duality - Example 16 An applied changelog becomes a real (materialized) table. name amount Alice 56 Bob 10 Alice 89 name total Alice 56 Bob 10 changelog +I[Alice, 89] +I[Bob, 10] +I[Alice, 56] +U[Alice, 145] -U[Alice, 56] +I[Bob, 10] +I[Alice, 56] 145 materialization CREATE TABLE Revenue (name STRING, total INT) WITH (…) INSERT INTO Revenue SELECT name, SUM(amount) FROM Transactions GROUP BY name CREATE TABLE Transactions (name STRING, amount INT) WITH (…)
  • 17. Stream-Table Duality - Example 17 An applied changelog becomes a real (materialized) table. name amount Alice 56 Bob 10 Alice 89 name total Alice 56 Bob 10 +I[Alice, 89] +I[Bob, 10] +I[Alice, 56] +U[Alice, 145] -U[Alice, 56] +I[Bob, 10] +I[Alice, 56] 145 materialization CREATE TABLE Revenue (PRIMARY KEY(name) …) WITH (…) INSERT INTO Revenue SELECT name, SUM(amount) FROM Transactions GROUP BY name CREATE TABLE Transactions (name STRING, amount INT) WITH (…) Save ~50% of traffic if downstream system supports upserting!
  • 18. Stream-Table Duality - Propagation 18 ● Source declares set of emitted changes i.e. changelog mode ● Optimizer tracks changelog mode and primary key through pipeline ● Sink declares changes it can digest CREATE TABLE … … WITH ('connector'='filesystem') … WITH ('connector'='kafka') … WITH ('connector'='kafka-upsert') … WITH ('connector'='jdbc') … WITH ('connector'='kafka', 'format' = 'debezium-json') +I +I +I -D +I -U +U -D +I (for sources)
  • 19. Retract vs. Upsert 19 Retract ● No primary key requirements ● Works for almost every external system ● Supports duplicate rows ● In distributed system often unavoidable à most flexible changelog mode à default mode Upsert ● Traffic + computation optimization ● In-place updates (idempotency) SELECT c, COUNT(*) FROM ( SELECT COUNT(*) AS c FROM T GROUP BY user ) GROUP BY c Count 1 Subtask 1 Count 2 Subtask 1 Subtask 2 +U[1] +U[2] +I[…] 1=>1 2=>1 Subtask 2 +I[…]
  • 20. Changelog Insights – Append-only 20 CREATE TABLE Transaction (tid BIGINT, amount INT); CREATE TABLE Payment (tid BIGINT, method STRING); CREATE TABLE Result (tid BIGINT, …); // accepts all changes INSERT INTO Result SELECT * FROM Transactions T JOIN Payments P ON T.tid = P.tid; Sink(table=[Result], changelogMode=[NONE]) +- Join(leftInputSpec=[NoUniqueKey], rightInputSpec=[NoUniqueKey], changelogMode=[I]) :- Exchange(changelogMode=[I]) : +- TableSourceScan(table=[[Transaction]], changelogMode=[I]) +- Exchange(changelogMode=[I]) +- TableSourceScan(table=[[Payment]], changelogMode=[I])
  • 21. Changelog Insights – Updating 21 CREATE TABLE Transaction (tid BIGINT, amount INT); CREATE TABLE Payment (tid BIGINT, method STRING); CREATE TABLE Result (tid BIGINT, …); INSERT INTO Result SELECT * FROM Transactions T LEFT JOIN Payments P ON T.tid = P.tid; Sink(table=[Result], changelogMode=[NONE]) +- Join(leftInputSpec=[NoUniqueKey], rightInputSpec=[NoUniqueKey], changelogMode=[I,UB,UA,D]) :- Exchange(changelogMode=[I]) : +- TableSourceScan(table=[[Transaction]], changelogMode=[I]) +- Exchange(changelogMode=[I]) +- TableSourceScan(table=[[Payment]], changelogMode=[I])
  • 22. Changelog Insights – Updating with PK 22 CREATE TABLE Transaction (tid BIGINT, amount INT); CREATE TABLE Payment (tid BIGINT, method STRING); CREATE TABLE Result (tid BIGINT, …, PRIMARY KEY(tid) NOT ENFORCED); INSERT INTO Result SELECT * FROM Transactions T LEFT JOIN Payments P ON T.tid = P.tid; Sink(table=[Result], changelogMode=[NONE], upsertMaterialize=[true]) +- Join(leftInputSpec=[NoUniqueKey], rightInputSpec=[NoUniqueKey], changelogMode=[I,UB,UA,D]) :- Exchange(changelogMode=[I]) : +- TableSourceScan(table=[[Transaction]], changelogMode=[I]) +- Exchange(changelogMode=[I]) +- TableSourceScan(table=[[Payment]], changelogMode=[I])
  • 23. Changelog Insights – Updating with PK 23 CREATE TABLE Transaction (tid BIGINT, …, PRIMARY KEY(tid) NOT ENFORCED); CREATE TABLE Payment (tid BIGINT, …, PRIMARY KEY(tid) NOT ENFORCED); CREATE TABLE Result (tid BIGINT, …, PRIMARY KEY(tid) NOT ENFORCED); INSERT INTO Result SELECT * FROM Transactions T LEFT JOIN Payments P ON T.tid = P.tid; Sink(table=[Result], changelogMode=[NONE]) +- Join(leftInputSpec=[UniqueKey], rightInputSpec=[UniqueKey], changelogMode=[I,UA,D]) :- Exchange(changelogMode=[I]) : +- TableSourceScan(table=[[Transaction]], changelogMode=[I]) +- Exchange(changelogMode=[I]) +- TableSourceScan(table=[[Payment]], changelogMode=[I])
  • 24. Mode Transitions 24 Append-only Retracting Updating through operation if operator/sink requires it ChangelogNormalize if sink requires it UpsertMaterialize
  • 25. Mode Transitions – Characteristics 25 Append-only ● Event-time column backed by watermarks ● Highly state efficient due to notion of completeness ● Usually no event-time column ● State usage needs to be kept in mind ● Pure materialized view maintenance Retracting Updating aka "TABLE" aka "STREAM" aka ?
  • 26. Mode Transitions – Joins 26 Append-only Append-only regular join Append-only Updating Append-only Updating Append-only Append-only regular outer join Updating regular join Append-only Updating temporal join Append-only
  • 27. Mode Transitions – Temporal Join 27 SELECT order_id, price, currency, conversion_rate, order_time FROM Orders LEFT JOIN CurrencyRates FOR SYSTEM_TIME AS OF Orders.order_time ON Orders.currency = CurrencyRates.currency; CREATE TABLE CurrencyRates ( WATERMARK FOR update_time AS …, PRIMARY KEY(currency) NOT ENFORCED,…);
  • 28. Mode Transitions – Explicit Transition without PK 28 Append-only Updating op update_time currency rate == =========== ======== ==== +I 09:00:00 Yen 102 +I 09:00:00 Euro 114 +I 09:00:00 USD 1 +I 11:15:00 Euro 119 +I 11:49:00 Pounds 108 op update_time currency rate == =========== ======== ==== +I 09:00:00 Yen 102 +I 09:00:00 Euro 114 +I 09:00:00 USD 1 +U 11:15:00 Euro 119 +I 11:49:00 Pounds 108
  • 29. Mode Transitions – Explicit Transition without PK 29 Append-only Updating CREATE VIEW versioned_rates AS SELECT currency, rate, update_time FROM ( SELECT *, ROW_NUMBER() OVER (PARTITION BY currency ORDER BY update_time DESC) AS rownum FROM currency_rates ) WHERE rownum = 1;
  • 31. Summary TLDR ● Flink's SQL engine is a powerful changelog processor ● Flexible tool for integrating systems with different semantics There is more… ● CDC connector ecosystem à 2.6k Github stars https://flink-packages.org/packages/cdc-connectors ● Table Store à unified storage engine for dynamic tables https://flink.apache.org/news/2022/05/11/release-table-store-0.1.0.html ● SQL Gateway https://cwiki.apache.org/confluence/display/FLINK/FLIP-91%3A+Support+SQL+Gateway 31