Flink 2.0: Navigating the Future of Unified Stream and Batch Processing

Flink 2.0: Navigating the Future of
Uniﬁed Stream and Batch Processing
Martijn Visser
Senior Product Manager and
Apache Flink PMC member

2
Real-time services rely on stream processing
Real-time
Data
A Sale
A Shipment
A Trade
Rich Front-End
Customer
Experiences
A Customer
Experience
Real-Time Backend
Operations
Real-time Stream Processing

3
Developers choose Flink because of its performance
and rich feature set
Scalability and
Performance
Fault
Tolerance
Flink is a top 5 Apache project and boasts a robust developer community
Uniﬁed
Processing
Flink is capable of
supporting stream
processing workloads
at tremendous scale
Language
Flexibility
Flink's fault tolerance
mechanisms ensure it
can handle failures
effectively and provide
high availability
Flink supports Java,
Python, & SQL with
150+ built-in functions,
enabling devs to work
in their language of
choice
Flink supports stream
processing, batch
processing, and ad-hoc
analytics through one
technology

4
The Future of Uniﬁed Stream and Batch Processing

5
Four Focus Areas
Mixed Unification
Mixed Unification
Use the Unified API and
Mix and Switch
automagically between
Batch and Stream
Execution modes, for
example when needing
to reprocess or backfill
data.
Unified SQL
Platform
Add support for other
common SQL elements
like DELETE, UPDATE,
Stored Procedures, Time
Travel and unstructured
data types.
Streaming
Warehouses
Integrate Streaming and
Batch processing with
real-time analytics and
up-to-date storage,
blending traditional
data warehouse benefits
with instant insights.
Engine Evolution
Engine Evolution
Cloud-native,
Disaggregated State
Backends, New APIs,
SQL Gateway, JDBC
Driver and much more

Mixed Uniﬁcation
6
• Flink supports Batch Execution and Streaming Execution mode
• What if you want to do backﬁll or reprocessing?
MySCL CDC → phase 1 reads from bounded snapshot, phase 2 from unbounded binlog.
S3 + Kafka (HybridSource) → read historical data from your lake before switching to real-time
A couple of proposals:
• FLIP-327: Support switching from batch to stream mode to
improve throughput when processing backlog data
• FLIP-309 Larger checkpointing interval processing backlog
• FLIP-326: Enhance Watermark to Support Processing-Time
Temporal Join

Uniﬁed SQL Platform: New DML Syntax
7
DELETE FROM user WHERE id = -1;
DELETE FROM user WHERE id > (SELECT count(*) FROM employee);
UPDATE user SET name = "u1" WHERE id > 10;
UPDATE user SET name = "u1" WHERE id > (SELECT count(*) FROM employee);
TRUNCATE TABLE user;
CALL `my_cat`.`my_db`.add_user("Martijn","Product Manager");

Uniﬁed SQL Platform: Time Travel
8
now
SELECT * FROM t FOR SYSTEM_TIME AS OF
TIMESTAMP '2023-03-19 00:00:00';
SELECT * FROM t;
SELECT * FROM t AS OF
FOR SYSTEM_TIME AS OF CURRENT_TIMESTAMP;

Streaming Warehouses
9
• Uniﬁed changelog & table representation, originated as FLIP-188: Introduce Built-in Dynamic
Table Storage
Now known as Apache Paimon (Incubating)
• Improve OLAP support, like quicker short-lived jobs to support OLAP queries with low latency
and concurrent execution.
• CBO (cost-based optimizations) with statistics
• Make full use of the layout and indexes on streaming lakehouse to reduce data reading and
processing for streaming queries.

Engine Evolution: Cloud-native, Disaggregated State
10
• FLIP-423: Disaggregated
State Storage and
Management
• FLIP-424: Asynchronous
State APIs
• FLIP-425: Asynchronous
Execution Model
• FLIP-426: Grouping
Remote State Access
• FLIP-427: ForSt -
Disaggregated state
Store
• FLIP-428: Fault Tolerance
/Rescale Integration for
Disaggregated State

Engine Evolution: DataStream API V2
11
FLIP-408: Introduce DataStream API V2
1. DataStream API exposes internal concepts and implementation details to users.
2. Complex API that provides primitives that corresponds to concepts of many different levels.
3. Started for Streaming, Batch was added later

Engine Evolution: Dynamic Tables
12
• Flink SQL and Table API always had the concept of Dynamic Tables
• FLIP-435: Introduce a New Dynamic Table for Simplifying Data Pipelines proposes new entity
CREATE DYNAMIC TABLE dwd_orders (
PRIMARY KEY(ds, id) NOT ENFORCED)
DISTRIBUTED BY (ds)
FRESHNESS = INTERVAL '3' MINUTE
AS
SELECT * FROM orders as o
LEFT JOIN order_pay AS pay
ON o.id = pay.order_id and o.ds = pay.ds

Engine Evolution: Polymorphic Table Functions
13
CREATE OR REPLACE PACKAGE BODY dynamic_cols_pkg IS
FUNCTION get_dynamic_cols(column_list column_list_t)
RETURN TABLE PIPELINED IS
BEGIN
-- Example logic to select different columns based on input
-- In practice, use dynamic SQL to build and execute the query
IF column_list.EXISTS(1) THEN
IF column_list(1) = 'name' THEN
PIPE ROW ('John Doe');
ELSIF column_list(1) = 'age' THEN
PIPE ROW (30);
END IF;
END IF;
RETURN;
END get_dynamic_cols;
END dynamic_cols_pkg;
SELECT * FROM
TABLE(dynamic_cols_pkg.get_dynamic_cols(colu
mn_list_t('name')));

15
Flink 2.0
https://cwiki.apache.org/conﬂuence/display/FLINK/2.0+Release

Flink 2.0 is primarily a clean-up
16
Removed in 2.0 Refactored in 2.0 New in 2.0 (or sooner)
• DataSet API
• Deprecated
methods/ﬁelds/classes in
DataStream API and Table
API
• Scala APIs
• Deprecated Source / Sink /
TableSource / Table Sink
interfaces
• Legacy SQL Function and
Operator stack
• Old conﬁguration layer
• Java 8 and 11 support
• Refactor the REST API
• Refactor the Metrics
System
• No default-to-Kryo
serialization
• Default to Java 17 (or 21)
• DataStream API V2
• Dynamic Tables
• Disaggregated State
Backend/Management
APIs

Flink 2.0 Expected Timeline
17
Flink 1.19 released
March 2024
Flink 1.20
Four/ﬁve months after
Flink 1.19
Jul/Aug 2024
Flink 2.0
Four/ﬁve months after
Flink 1.20
Dec/Jan

Flink 2.0: Navigating the Future of Unified Stream and Batch Processing

Flink 2.0: Navigating the Future of Unified Stream and Batch Processing

More Related Content

What's hot

Similar to Flink 2.0: Navigating the Future of Unified Stream and Batch Processing

More from HostedbyConfluent

Recently uploaded

Flink 2.0: Navigating the Future of Unified Stream and Batch Processing