Flink SQL:
The Challenges to Build a Streaming SQL Engine
Jinsong (Jingsong) Lee
Staff Engineer at Alibaba
Apache Flink PMC member & Paimon Founder
About Me
• Staff Engineer in Alibaba
(Lake Storage Team Lead)
• PMC member of Apache Flink
(Committer of Apache Iceberg, Beam)
• Founder of Apache Paimon
(A new Lake Format focus on streaming)
What is Flink SQL?
CONTENT
Challenges of Flink SQL
State and Storage
Summary and Futures
Content
What is Flink SQL?
• Data Movement
• Data Warehouse
Main Scenarios of Flink SQL
Data Movement
Data Integration
Connector + Calc + Lookup Join
Data Warehouse
Unbounded Aggregate, Join
PV, UV …
Event Driven
Risk Control, Monitoring alarm
Window, Interval Join, UDF, CEP
Powerful Connector Ecosystem
TiDB
ApsaraDB MySQL
TiDB
ClickHouse
Iceberg
Hudi
Paimon
Streaming & Batch
Calc & UDF & Lookup Join
Flink SQL & Flink CDC
Data Movement: More and more companies are building streaming and batch
unified data integration platforms based on Flink SQL. 100,000 Jobs +
How Flink SQL Works?
SQL
Table API
Logical Plan Physical Plan Transformations JobGraph
Configurable optimizer phases
Catalog
Hive
Metastore
Code Generation
Optimizer
SubQuery
Decorrelation
Filter/Project
PushDown
Join
Reorder
…
Code Optimizations State-of-art Operators Resource Optimizations
Generated operators
JVM intrinsic
Declarative expressions
Operate on binary data
Cache efficient sorter
Compact binary hash map
Hybrid hash join
Full managed memory
IO Manager
Off-Heap memory
Flink Cluster
Submit Job
An Example
SELECT
t1.id, 1 + 2 + t1.value AS v
FROM t1 JOIN t2
WHERE
t1.id = t2.id AND
t2.id < 1000
Scan (t1) Scan (t2)
Join
Filter
Project
t1.id = t2.id
t2.id < 1000
t1.id,
1+2+t1.value
Logical Plan
SQL Query
Changelog Mechanism
word
Hello word cnt
Hello 1
word_count
World 1
Hello 2
SELECT
word,
COUNT(*) as cnt
FROM logs
GROUP BY word
World
Hello
cnt freq
1 2
2 1
1 1
SELECT
cnt,
COUNT(cnt) as freq
FROM word_count
GROUP BY cnt
with changelog
由查询优化器判断是否需要Retraction,用户无感知。
① Changelog makes the streaming query result correct
② Query optimizer determines whether update_before is needed
③ Users are not aware of it
Hello, 1
insert
World, 1
insert
Hello, 1
update_before
Hello, 2
update_after
Hello
insert
World
insert
Hello
insert
Source Word Count Count Frequency
Changelog Make CDC processing transparent
INSERT INTO dynamo_table SELECT
o.order_id, o.total, c.country, CONCAT(msg, ‘_SUFFIX’) AS
msg
FROM Orders_CDC AS o
JOIN Customers FOR SYSTEM_TIME AS OF o.proc_time AS c
ON o.customer_id = c.id;;
Lookup Join
Orders
Customers
Challenges of Flink SQL
• Late Data: Unbounded Operators
• Retractions Amplification: Mini-Batch
• Event Ordering
• Nondeterminism
• No Watermark and Late Event
• Unlimited State: or manually evaluated State Time-To-Live
• Upsert Sink: Output in advance, relying on idempotence to
achieve final consistency
Late Data & Unbounded Operators
SELECT SUM(num) FROM T GROUP BY color
Source Upsert Sink
Real World:
GROUP BY color and day…
Retractions Amplification in Complex DAG
Scan (t1) Left Join Left Join
Scan (t2)
Aggregate
Scan (t3)
1 Record
2 Record:
1 -U
1 +U
4 Record:
2 -U
2 +U
8 Record:
4 -U
4 +U
• Flink SQL Changelog Mechanism: +I -U +U -D
• Stateful Operator: Produce -U and +U for Update
• Amplification in Complex DAG, 8X ……
• Use heap memory to hold bundle
• In-memory aggregation before
accessing states and serde operations
• Also ease the downstream loads
• But Lack Mini-Batch Join
Mini-Batch to Reduce Amplification
Mini-Batch aggregation:
table.exec.mini-batch.enabled = true
table.exec.mini-batch.allow-latency = “5000 ms”
table.exec.mini-batch.size = 1000
SELECT SUM(num) FROM T GROUP BY color
Event Ordering for CDC Sources
-- CDC source tables: s1 & s2
s1: id BIGINT, level BIGINT, PRIMARY KEY(id)
s2: id BIGINT, attr VARCHAR, PRIMARY KEY(id)
-- sink table: t1
t1: id BIGINT, level BIGINT, attr VARCHAR,
PRIMARY KEY(id)
-- join s1 and s2 and insert the result into t1
INSERT INTO t1
SELECT s1.*, s2.attr
FROM s1 JOIN s2 ON s1.level = s2.id
Data Shuffle in Distributed environments make changelog out-of-orderness
Event Ordering: Solution
• Sink Upsert Materializer
• Rely on State (State TTL)
• Poor Performance
• Optimize:
• Just use RocksDB inside
checkpoint
• Sink Store supports
version fields
Nondeterminism
CURRENT_TIMESTAMP
RANDOM
• If the source is a CDC source?
• Retraction output different
records?
Group
by
Sum
Sum is incorrect!
CDC
SRC
How to solve Nondeterminism
Streaming Deduplicate
CURRENT_TIMESTAMP
RANDOM
Dedup
CDC
SRC
Group
by
Sum
Or dedup by Streaming Lake Paimon
Deduplicate By
State
State & Storage
Local State
DFS
Periodically CP
dump state files
Task
Managers
(Compute)
Flink Task
Local
Disk
State
Manager
Flink
OP
Main
Storage
Flink Task
state files
State
Manager
Flink
OP
state files
CP files
• Main in Local
• High Performance
• Small State 👍
• Big State ❎ State TTL Local
Disk
Incrementally
Asynchronously
Disaggregated State: Flink 2.0
DFS
Task
Managers
(Compute)
state files
Main
Storage
Optional
Cache
Upload
state files
Flink Task
Local
Disk
State
Manager
Flink
OP
Mem
Async
Flink Task
State
Manager
Flink
OP
Mem
CP1
CP2
CP3
Tasks Share
Local
Disk
• Main in DFS
• Big State 👍
• How to cut data?
• How to Rescale?
• How to Share?
Lake State (Apache Paimon)
Logs
RDBMS
Flink Table Store Flink Table Store
)OLQN 64/
6WUHDPLQJ %DWFK
)OLQN 64/
6WUHDPLQJ %DWFK
binlog
Data Serving
Systems
)OLQN 64/
4XHULHV
Flink Table Store
2'6 ':' ':6
$'6
)OLQN 64/
6WUHDPLQJ %DWFK
Paimon Paimon Paimon
Flink CDC
• Latency: Minute Level
• Merge Engine: Deduplicate, Partial-Update, Aggregation, First Row
• No Data TTL, Performance Improvement, 10X
Summary & Futures
• Flink SQL: Data movement, Data Warehouse, Event Driven.
• The core concept of Flink SQL is CHANGELOG.
• The user case of Data Movement.
• 4 Challenges of Flink SQL
• Late Data: Unbounded Operators
• Retractions Amplification: Mini-Batch
• Event Ordering
• Nondeterminism
• State of Flink SQL, improvement and alternative
Summary & Futures
Thank You!
Questions?

Flink SQL: The Challenges to Build a Streaming SQL Engine

  • 1.
    Flink SQL: The Challengesto Build a Streaming SQL Engine Jinsong (Jingsong) Lee Staff Engineer at Alibaba Apache Flink PMC member & Paimon Founder
  • 2.
    About Me • StaffEngineer in Alibaba (Lake Storage Team Lead) • PMC member of Apache Flink (Committer of Apache Iceberg, Beam) • Founder of Apache Paimon (A new Lake Format focus on streaming)
  • 3.
    What is FlinkSQL? CONTENT Challenges of Flink SQL State and Storage Summary and Futures Content
  • 4.
    What is FlinkSQL? • Data Movement • Data Warehouse
  • 5.
    Main Scenarios ofFlink SQL Data Movement Data Integration Connector + Calc + Lookup Join Data Warehouse Unbounded Aggregate, Join PV, UV … Event Driven Risk Control, Monitoring alarm Window, Interval Join, UDF, CEP
  • 6.
    Powerful Connector Ecosystem TiDB ApsaraDBMySQL TiDB ClickHouse Iceberg Hudi Paimon Streaming & Batch Calc & UDF & Lookup Join Flink SQL & Flink CDC Data Movement: More and more companies are building streaming and batch unified data integration platforms based on Flink SQL. 100,000 Jobs +
  • 7.
    How Flink SQLWorks? SQL Table API Logical Plan Physical Plan Transformations JobGraph Configurable optimizer phases Catalog Hive Metastore Code Generation Optimizer SubQuery Decorrelation Filter/Project PushDown Join Reorder … Code Optimizations State-of-art Operators Resource Optimizations Generated operators JVM intrinsic Declarative expressions Operate on binary data Cache efficient sorter Compact binary hash map Hybrid hash join Full managed memory IO Manager Off-Heap memory Flink Cluster Submit Job
  • 8.
    An Example SELECT t1.id, 1+ 2 + t1.value AS v FROM t1 JOIN t2 WHERE t1.id = t2.id AND t2.id < 1000 Scan (t1) Scan (t2) Join Filter Project t1.id = t2.id t2.id < 1000 t1.id, 1+2+t1.value Logical Plan SQL Query
  • 9.
    Changelog Mechanism word Hello wordcnt Hello 1 word_count World 1 Hello 2 SELECT word, COUNT(*) as cnt FROM logs GROUP BY word World Hello cnt freq 1 2 2 1 1 1 SELECT cnt, COUNT(cnt) as freq FROM word_count GROUP BY cnt with changelog 由查询优化器判断是否需要Retraction,用户无感知。 ① Changelog makes the streaming query result correct ② Query optimizer determines whether update_before is needed ③ Users are not aware of it Hello, 1 insert World, 1 insert Hello, 1 update_before Hello, 2 update_after Hello insert World insert Hello insert Source Word Count Count Frequency
  • 10.
    Changelog Make CDCprocessing transparent INSERT INTO dynamo_table SELECT o.order_id, o.total, c.country, CONCAT(msg, ‘_SUFFIX’) AS msg FROM Orders_CDC AS o JOIN Customers FOR SYSTEM_TIME AS OF o.proc_time AS c ON o.customer_id = c.id;; Lookup Join Orders Customers
  • 11.
    Challenges of FlinkSQL • Late Data: Unbounded Operators • Retractions Amplification: Mini-Batch • Event Ordering • Nondeterminism
  • 12.
    • No Watermarkand Late Event • Unlimited State: or manually evaluated State Time-To-Live • Upsert Sink: Output in advance, relying on idempotence to achieve final consistency Late Data & Unbounded Operators SELECT SUM(num) FROM T GROUP BY color Source Upsert Sink Real World: GROUP BY color and day…
  • 13.
    Retractions Amplification inComplex DAG Scan (t1) Left Join Left Join Scan (t2) Aggregate Scan (t3) 1 Record 2 Record: 1 -U 1 +U 4 Record: 2 -U 2 +U 8 Record: 4 -U 4 +U • Flink SQL Changelog Mechanism: +I -U +U -D • Stateful Operator: Produce -U and +U for Update • Amplification in Complex DAG, 8X ……
  • 14.
    • Use heapmemory to hold bundle • In-memory aggregation before accessing states and serde operations • Also ease the downstream loads • But Lack Mini-Batch Join Mini-Batch to Reduce Amplification Mini-Batch aggregation: table.exec.mini-batch.enabled = true table.exec.mini-batch.allow-latency = “5000 ms” table.exec.mini-batch.size = 1000 SELECT SUM(num) FROM T GROUP BY color
  • 15.
    Event Ordering forCDC Sources -- CDC source tables: s1 & s2 s1: id BIGINT, level BIGINT, PRIMARY KEY(id) s2: id BIGINT, attr VARCHAR, PRIMARY KEY(id) -- sink table: t1 t1: id BIGINT, level BIGINT, attr VARCHAR, PRIMARY KEY(id) -- join s1 and s2 and insert the result into t1 INSERT INTO t1 SELECT s1.*, s2.attr FROM s1 JOIN s2 ON s1.level = s2.id Data Shuffle in Distributed environments make changelog out-of-orderness
  • 16.
    Event Ordering: Solution •Sink Upsert Materializer • Rely on State (State TTL) • Poor Performance • Optimize: • Just use RocksDB inside checkpoint • Sink Store supports version fields
  • 17.
    Nondeterminism CURRENT_TIMESTAMP RANDOM • If thesource is a CDC source? • Retraction output different records? Group by Sum Sum is incorrect! CDC SRC
  • 18.
    How to solveNondeterminism Streaming Deduplicate CURRENT_TIMESTAMP RANDOM Dedup CDC SRC Group by Sum Or dedup by Streaming Lake Paimon Deduplicate By State
  • 19.
  • 20.
    Local State DFS Periodically CP dumpstate files Task Managers (Compute) Flink Task Local Disk State Manager Flink OP Main Storage Flink Task state files State Manager Flink OP state files CP files • Main in Local • High Performance • Small State 👍 • Big State ❎ State TTL Local Disk Incrementally Asynchronously
  • 21.
    Disaggregated State: Flink2.0 DFS Task Managers (Compute) state files Main Storage Optional Cache Upload state files Flink Task Local Disk State Manager Flink OP Mem Async Flink Task State Manager Flink OP Mem CP1 CP2 CP3 Tasks Share Local Disk • Main in DFS • Big State 👍 • How to cut data? • How to Rescale? • How to Share?
  • 22.
    Lake State (ApachePaimon) Logs RDBMS Flink Table Store Flink Table Store )OLQN 64/ 6WUHDPLQJ %DWFK )OLQN 64/ 6WUHDPLQJ %DWFK binlog Data Serving Systems )OLQN 64/ 4XHULHV Flink Table Store 2'6 ':' ':6 $'6 )OLQN 64/ 6WUHDPLQJ %DWFK Paimon Paimon Paimon Flink CDC • Latency: Minute Level • Merge Engine: Deduplicate, Partial-Update, Aggregation, First Row • No Data TTL, Performance Improvement, 10X
  • 23.
  • 24.
    • Flink SQL:Data movement, Data Warehouse, Event Driven. • The core concept of Flink SQL is CHANGELOG. • The user case of Data Movement. • 4 Challenges of Flink SQL • Late Data: Unbounded Operators • Retractions Amplification: Mini-Batch • Event Ordering • Nondeterminism • State of Flink SQL, improvement and alternative Summary & Futures
  • 25.