Flink 1.9.0 added the ability to support multiple SQL planners under the same API. With this help. we successfully merged a lot features which comes from Alibaba's internal flink version, called blink. In this talk, I will give a introduction about the architecture of the blink planner, and also share with you the functionalities and performance enhancements we added.
8. Major Blink Planner Features
Common
Streaming
Batch
New Type
System
Binary
Format
Aggregation
Skew Handling
Bundle
Processing
Dimension
Join
Top N
Streaming
De-duplication
Hash-based
Algorithms
Sort-based
Algorithms
Full TPC-H
Support
10. • [FLIP-37] Rework of the Table API Type System
• Blink planner uses new type system instead of TypeInformation
• Some new features, but still have lots to do
• Support Decimal(p, s)
• Support nullability
• Support TIMESTAMP WITH LOCAL ZONE
New Type System
12. • Deeply integrated with memory segments
• No need to deserialize / Compact layout / Random accessible
• Also have BinaryString, BinaryArray, BinaryMap
Binary Formats (BinaryRow)
2019 pointer pointer 5 Flink 7 Forward
Memory Segment
Fixed-length part Variable-length partNull info
Header (Row Type)
14. Distinct Aggregation Skew Handling
• Local combine doesn’t work for
distinct aggregation
SELECT COUNT(DISTINCT id) FROM T GROUP BY color
• Optimize as query rewriting:
SELECT color, SUM(cnt)
FROM (
SELECT color, COUNT(DISTINCT id) as cnt
FROM T
GROUP BY color, MOD(HASH_CODE(id), 1024)
)
GROUP BY color
table.optimizer.distinct-agg.split.enabled = true
16. • Each record would cost:
• One state reading and writing
• One deserialization and serialization
• One output
Bundle Processing
Normal aggregation:
SELECT SUM(num) FROM T GROUP BY color
17. • Use heap memory to hold bundle
• In-memory aggregation before
accessing states and serde operations
• Also ease the downstream loads
Bundle Processing
Bundled aggregation:
table.exec.mini-batch.enabled = true
table.exec.mini-batch.allow-latency = “5000 ms”
table.exec.mini-batch.size = 1000
SELECT SUM(num) FROM T GROUP BY color
18. • Dimension and fact table are popular concepts in data warehouse as well as
streaming processing
• Frequently asked scenarios
• Reading facts from message queue while dimension data stored in DB, key-value store
• Enrich the facts with latest dimension data
• The dimension table itself might also changing
• Different with regular streaming join
• Changes of dimension table doesn’t trigger the join
Dimension Join
19. • Model dimension table as time-varying relations[1] (TVR), a relation that
changes over time
• Temporal table introduces a new FOR SYSTEM_TIME keyword to access
any point in time of the table
• No need to store whole dimension table in state if it’s a
LookupableTableSource
Processing Time Dimension Join
SELECT o.*, p.*
FROM Orders AS o
JOIN Products FOR SYSTEM_TIME AS OF PROCTIME() AS p
ON o.productId = p.productId
[1] One SQL to Rule Them All
20. • It’s impractical to do a global streaming sort
• But it becomes possible if user only cares about the top n elements
• E.g. Calculate the top 3 sellers for each category
Top N
SELECT *
FROM (
SELECT *, // you can get like sellerId or other information from this
ROW_NUMBER() OVER (PARTITION BY category ORDER BY sales DESC) AS rowNum
FROM shop_sales)
WHERE rowNum <= 3
21. Top N
OverAggregate
Calc
…
…
Rank
…
…
Original Plan Optimized Plan
• Some other optimization factors:
• Whether the rank operator has to deal
with retraction
• Whether partition key is part of the
primary key
• Whether the ordered field is monotonic
22. • Primary key needed, and there are basically 2 scenarios:
Upstream emits repeated data due to recovery, thus only the first row is meaningful
Upstream keeps updating the outputs with key, thus only the latest row meaningful
• Similar with Top 1
Streaming De-duplication
SELECT parimary_key, a, b, c
FROM (
SELECT
*,
ROW_NUMBER() OVER (
PARTITION BY parimary_key
ORDER BY proctime ASC) AS rowNum
FROM T)
WHERE rowNum == 1
SELECT parimary_key, a, b, c
FROM (
SELECT
*,
ROW_NUMBER() OVER (
PARTITION BY parimary_key
ORDER BY proctime DESC) AS rowNum
FROM T)
WHERE rowNum == 1
Keep first row Keep last row
24. • Finiteness makes sorting practical and efficient
• Push based multi-threading sorter
• Mainly borrowed from Flink’s original UnilateralSortMerger
• Change from pull based to push based
• Adapted to binary formats / Hotspot code generated
• Sort aggregation
• Sort merge join
• Support inner/left/right/full/semi/anti joins
Sort-bases Algorithms
25. • Hash aggregate
• Hash map based on binary formats
• Auto detect hash set mode(select distinct)
• Fallback to sort aggregation in zero-copy way
• Hash join
• Mainly borrowed from Flink’s original MutableHashTable
• Change from pull based to push based
• Support inner/left/right/full/semi/anti joins
Hash-based Algorithms
26. • More data types
• Support all types of joins
• Sub query decorrelation
• Over window enhancements
• Full TPC-H support
More Functionality Coverage
27. • Flink took a big step towards truly unified architecture
• Blink planner is a state-of-the-art query processor for both batch & streaming
• Future (Flink 1.10+)
• Finalize new type system
• Finalize blink merge
• Full TPC-DS support
• Hopefully some feedbacks, to help us improve
Summary & Futures