Spark 3.0 will include several new features including resolving over 2000 issues, a data source API with catalog support, adaptive query execution, dynamic partition pruning, ANSI SQL compliance, and running on Kubernetes. It will also improve query optimization using rules, costs, and runtime information. Koalas enables working with big data in Python using pandas APIs on Spark DataFrames. Delta Lake provides a standard for building data lakes using a transactional storage layer based on Parquet that ensures data reliability and supports both batch and streaming workloads.
10. Dynamic Partition Pruning
Project
Join
Filter
Scan Scan
t1: a large fact table
with many partitions
t2.id < 2
t2: a dimension
table with a filter
SELECT t1.id, t2.pKey
FROM t1
JOIN t2
ON t1.pKey = t2.pKey
AND t2.id < 2
t1.pKey = t2.pKey
11. Project
Join
Filter + Scan Scan
t1.pKey = t2.pKey
Scan all the partitions
t2.id < 2
Filter pushdown
Filter
t1.pkey IN (
SELECT t2.pKey
FROM t2
WHERE t2.id < 2)
SELECT t1.id, t2.pKey
FROM t1
JOIN t2
ON t1.pKey = t2.pKey
AND t2.id < 2
Dynamic Partition Pruning
12. Project
Join
Filter + Scan Filter + Scan
Scan the affected partitions
t2.id < 2
Filter pushdown
t1.pKey in DPPFilterResult
Speed: 33 X faster Scan: 90+% less
Dynamic Partition Pruning
t1.pKey = t2.pKey
18. 18
BUSINESS INTELLIGENCE BIG DATA ANALYTICS UNIFYING DATA + AI
Autonomous Decisions
Advanced ML / AI - Make the
decision for me
Spark as a unifying technology -
Data processing at scale +
analytics + AI
Data Warehouses
Reactive Analytics - What just
happened?
Predictive Analytics
Proactive Analytics - Predict
what’s going to happen
33. 33
Databricks Cluster
Data consistency and
integrity:
not available before
Increased data quality:
name match accuracy up
80% → 95%
Faster data loads:
一 天 → 二十分钟