Flink 2.0: Navigating the Future of
Unified Stream and Batch Processing
Martijn Visser
Senior Product Manager and
Apache Flink PMC member
2
Real-time services rely on stream processing
Real-time
Data
A Sale
A Shipment
A Trade
Rich Front-End
Customer
Experiences
A Customer
Experience
Real-Time Backend
Operations
Real-time Stream Processing
3
Developers choose Flink because of its performance
and rich feature set
Scalability and
Performance
Fault
Tolerance
Flink is a top 5 Apache project and boasts a robust developer community
Unified
Processing
Flink is capable of
supporting stream
processing workloads
at tremendous scale
Language
Flexibility
Flink's fault tolerance
mechanisms ensure it
can handle failures
effectively and provide
high availability
Flink supports Java,
Python, & SQL with
150+ built-in functions,
enabling devs to work
in their language of
choice
Flink supports stream
processing, batch
processing, and ad-hoc
analytics through one
technology
4
The Future of Unified Stream and Batch Processing
5
Four Focus Areas
Mixed Unification
Mixed Unification
Use the Unified API and
Mix and Switch
automagically between
Batch and Stream
Execution modes, for
example when needing
to reprocess or backfill
data.
Unified SQL
Platform
Add support for other
common SQL elements
like DELETE, UPDATE,
Stored Procedures, Time
Travel and unstructured
data types.
Streaming
Warehouses
Integrate Streaming and
Batch processing with
real-time analytics and
up-to-date storage,
blending traditional
data warehouse benefits
with instant insights.
Engine Evolution
Engine Evolution
Cloud-native,
Disaggregated State
Backends, New APIs,
SQL Gateway, JDBC
Driver and much more
Mixed Unification
6
• Flink supports Batch Execution and Streaming Execution mode
• What if you want to do backfill or reprocessing?
MySCL CDC → phase 1 reads from bounded snapshot, phase 2 from unbounded binlog.
S3 + Kafka (HybridSource) → read historical data from your lake before switching to real-time
A couple of proposals:
• FLIP-327: Support switching from batch to stream mode to
improve throughput when processing backlog data
• FLIP-309 Larger checkpointing interval processing backlog
• FLIP-326: Enhance Watermark to Support Processing-Time
Temporal Join
Unified SQL Platform: New DML Syntax
7
DELETE FROM user WHERE id = -1;
DELETE FROM user WHERE id > (SELECT count(*) FROM employee);
UPDATE user SET name = "u1" WHERE id > 10;
UPDATE user SET name = "u1" WHERE id > (SELECT count(*) FROM employee);
TRUNCATE TABLE user;
CALL `my_cat`.`my_db`.add_user("Martijn","Product Manager");
Unified SQL Platform: Time Travel
8
now
SELECT * FROM t FOR SYSTEM_TIME AS OF
TIMESTAMP '2023-03-19 00:00:00';
SELECT * FROM t;
SELECT * FROM t AS OF
FOR SYSTEM_TIME AS OF CURRENT_TIMESTAMP;
Streaming Warehouses
9
• Unified changelog & table representation, originated as FLIP-188: Introduce Built-in Dynamic
Table Storage
Now known as Apache Paimon (Incubating)
• Improve OLAP support, like quicker short-lived jobs to support OLAP queries with low latency
and concurrent execution.
• CBO (cost-based optimizations) with statistics
• Make full use of the layout and indexes on streaming lakehouse to reduce data reading and
processing for streaming queries.
Engine Evolution: Cloud-native, Disaggregated State
10
• FLIP-423: Disaggregated
State Storage and
Management
• FLIP-424: Asynchronous
State APIs
• FLIP-425: Asynchronous
Execution Model
• FLIP-426: Grouping
Remote State Access
• FLIP-427: ForSt -
Disaggregated state
Store
• FLIP-428: Fault Tolerance
/Rescale Integration for
Disaggregated State
Engine Evolution: DataStream API V2
11
FLIP-408: Introduce DataStream API V2
1. DataStream API exposes internal concepts and implementation details to users.
2. Complex API that provides primitives that corresponds to concepts of many different levels.
3. Started for Streaming, Batch was added later
Engine Evolution: Dynamic Tables
12
• Flink SQL and Table API always had the concept of Dynamic Tables
• FLIP-435: Introduce a New Dynamic Table for Simplifying Data Pipelines proposes new entity
CREATE DYNAMIC TABLE dwd_orders (
PRIMARY KEY(ds, id) NOT ENFORCED)
DISTRIBUTED BY (ds)
FRESHNESS = INTERVAL '3' MINUTE
AS
SELECT * FROM orders as o
LEFT JOIN order_pay AS pay
ON o.id = pay.order_id and o.ds = pay.ds
Engine Evolution: Polymorphic Table Functions
13
CREATE OR REPLACE PACKAGE BODY dynamic_cols_pkg IS
FUNCTION get_dynamic_cols(column_list column_list_t)
RETURN TABLE PIPELINED IS
BEGIN
-- Example logic to select different columns based on input
-- In practice, use dynamic SQL to build and execute the query
IF column_list.EXISTS(1) THEN
IF column_list(1) = 'name' THEN
PIPE ROW ('John Doe');
ELSIF column_list(1) = 'age' THEN
PIPE ROW (30);
END IF;
END IF;
RETURN;
END get_dynamic_cols;
END dynamic_cols_pkg;
SELECT * FROM
TABLE(dynamic_cols_pkg.get_dynamic_cols(colu
mn_list_t('name')));
14
Flink 2.0
15
Flink 2.0
https://cwiki.apache.org/confluence/display/FLINK/2.0+Release
Flink 2.0 is primarily a clean-up
16
Removed in 2.0 Refactored in 2.0 New in 2.0 (or sooner)
• DataSet API
• Deprecated
methods/fields/classes in
DataStream API and Table
API
• Scala APIs
• Deprecated Source / Sink /
TableSource / Table Sink
interfaces
• Legacy SQL Function and
Operator stack
• Old configuration layer
• Java 8 and 11 support
• Refactor the REST API
• Refactor the Metrics
System
• No default-to-Kryo
serialization
• Default to Java 17 (or 21)
• DataStream API V2
• Dynamic Tables
• Disaggregated State
Backend/Management
APIs
Flink 2.0 Expected Timeline
17
Flink 1.19 released
March 2024
Flink 1.20
Four/five months after
Flink 1.19
Jul/Aug 2024
Flink 2.0
Four/five months after
Flink 1.20
Dec/Jan
Thank You
18
Flink 2.0: Navigating the Future of Unified Stream and Batch Processing

Flink 2.0: Navigating the Future of Unified Stream and Batch Processing

  • 1.
    Flink 2.0: Navigatingthe Future of Unified Stream and Batch Processing Martijn Visser Senior Product Manager and Apache Flink PMC member
  • 2.
    2 Real-time services relyon stream processing Real-time Data A Sale A Shipment A Trade Rich Front-End Customer Experiences A Customer Experience Real-Time Backend Operations Real-time Stream Processing
  • 3.
    3 Developers choose Flinkbecause of its performance and rich feature set Scalability and Performance Fault Tolerance Flink is a top 5 Apache project and boasts a robust developer community Unified Processing Flink is capable of supporting stream processing workloads at tremendous scale Language Flexibility Flink's fault tolerance mechanisms ensure it can handle failures effectively and provide high availability Flink supports Java, Python, & SQL with 150+ built-in functions, enabling devs to work in their language of choice Flink supports stream processing, batch processing, and ad-hoc analytics through one technology
  • 4.
    4 The Future ofUnified Stream and Batch Processing
  • 5.
    5 Four Focus Areas MixedUnification Mixed Unification Use the Unified API and Mix and Switch automagically between Batch and Stream Execution modes, for example when needing to reprocess or backfill data. Unified SQL Platform Add support for other common SQL elements like DELETE, UPDATE, Stored Procedures, Time Travel and unstructured data types. Streaming Warehouses Integrate Streaming and Batch processing with real-time analytics and up-to-date storage, blending traditional data warehouse benefits with instant insights. Engine Evolution Engine Evolution Cloud-native, Disaggregated State Backends, New APIs, SQL Gateway, JDBC Driver and much more
  • 6.
    Mixed Unification 6 • Flinksupports Batch Execution and Streaming Execution mode • What if you want to do backfill or reprocessing? MySCL CDC → phase 1 reads from bounded snapshot, phase 2 from unbounded binlog. S3 + Kafka (HybridSource) → read historical data from your lake before switching to real-time A couple of proposals: • FLIP-327: Support switching from batch to stream mode to improve throughput when processing backlog data • FLIP-309 Larger checkpointing interval processing backlog • FLIP-326: Enhance Watermark to Support Processing-Time Temporal Join
  • 7.
    Unified SQL Platform:New DML Syntax 7 DELETE FROM user WHERE id = -1; DELETE FROM user WHERE id > (SELECT count(*) FROM employee); UPDATE user SET name = "u1" WHERE id > 10; UPDATE user SET name = "u1" WHERE id > (SELECT count(*) FROM employee); TRUNCATE TABLE user; CALL `my_cat`.`my_db`.add_user("Martijn","Product Manager");
  • 8.
    Unified SQL Platform:Time Travel 8 now SELECT * FROM t FOR SYSTEM_TIME AS OF TIMESTAMP '2023-03-19 00:00:00'; SELECT * FROM t; SELECT * FROM t AS OF FOR SYSTEM_TIME AS OF CURRENT_TIMESTAMP;
  • 9.
    Streaming Warehouses 9 • Unifiedchangelog & table representation, originated as FLIP-188: Introduce Built-in Dynamic Table Storage Now known as Apache Paimon (Incubating) • Improve OLAP support, like quicker short-lived jobs to support OLAP queries with low latency and concurrent execution. • CBO (cost-based optimizations) with statistics • Make full use of the layout and indexes on streaming lakehouse to reduce data reading and processing for streaming queries.
  • 10.
    Engine Evolution: Cloud-native,Disaggregated State 10 • FLIP-423: Disaggregated State Storage and Management • FLIP-424: Asynchronous State APIs • FLIP-425: Asynchronous Execution Model • FLIP-426: Grouping Remote State Access • FLIP-427: ForSt - Disaggregated state Store • FLIP-428: Fault Tolerance /Rescale Integration for Disaggregated State
  • 11.
    Engine Evolution: DataStreamAPI V2 11 FLIP-408: Introduce DataStream API V2 1. DataStream API exposes internal concepts and implementation details to users. 2. Complex API that provides primitives that corresponds to concepts of many different levels. 3. Started for Streaming, Batch was added later
  • 12.
    Engine Evolution: DynamicTables 12 • Flink SQL and Table API always had the concept of Dynamic Tables • FLIP-435: Introduce a New Dynamic Table for Simplifying Data Pipelines proposes new entity CREATE DYNAMIC TABLE dwd_orders ( PRIMARY KEY(ds, id) NOT ENFORCED) DISTRIBUTED BY (ds) FRESHNESS = INTERVAL '3' MINUTE AS SELECT * FROM orders as o LEFT JOIN order_pay AS pay ON o.id = pay.order_id and o.ds = pay.ds
  • 13.
    Engine Evolution: PolymorphicTable Functions 13 CREATE OR REPLACE PACKAGE BODY dynamic_cols_pkg IS FUNCTION get_dynamic_cols(column_list column_list_t) RETURN TABLE PIPELINED IS BEGIN -- Example logic to select different columns based on input -- In practice, use dynamic SQL to build and execute the query IF column_list.EXISTS(1) THEN IF column_list(1) = 'name' THEN PIPE ROW ('John Doe'); ELSIF column_list(1) = 'age' THEN PIPE ROW (30); END IF; END IF; RETURN; END get_dynamic_cols; END dynamic_cols_pkg; SELECT * FROM TABLE(dynamic_cols_pkg.get_dynamic_cols(colu mn_list_t('name')));
  • 14.
  • 15.
  • 16.
    Flink 2.0 isprimarily a clean-up 16 Removed in 2.0 Refactored in 2.0 New in 2.0 (or sooner) • DataSet API • Deprecated methods/fields/classes in DataStream API and Table API • Scala APIs • Deprecated Source / Sink / TableSource / Table Sink interfaces • Legacy SQL Function and Operator stack • Old configuration layer • Java 8 and 11 support • Refactor the REST API • Refactor the Metrics System • No default-to-Kryo serialization • Default to Java 17 (or 21) • DataStream API V2 • Dynamic Tables • Disaggregated State Backend/Management APIs
  • 17.
    Flink 2.0 ExpectedTimeline 17 Flink 1.19 released March 2024 Flink 1.20 Four/five months after Flink 1.19 Jul/Aug 2024 Flink 2.0 Four/five months after Flink 1.20 Dec/Jan
  • 18.