New Developments in the Open Source Ecosystem: Apache Spark 3.0, Delta Lake, and Koalas

Urs Hölzle
@uhoelzle
@JeﬀDean @GCPcloud R.I.P. MapReduce. After having
served us well since 2003, today we removed the
remaining internal codebase for good.
MR was a seminal idea in 2003 but we've learned a lot
since then. [There are new systems that] express
pipelines more naturally with less code, and you get
both batch and streaming from the same code.
2019

❌
Separate Compute & Storage
AI & More than SQL
Open Source at Scale
Data
Warehouse
Hadoop
M/R
❌
❌
✔
✔
✔

Traditional RDBMS Opinion 2008

SQL & Optimization
Data Model & Catalog
ACID Transactions
✔
✔
✔
❌
❌
❌
❌
Separate Compute & Storage
More than SQL (i.e ML)
Open Source at Scale
Data
Warehouse
Hadoop
M/R
❌
❌
✔
✔
✔
✔
✔
✔
✔
Data Science at Scale ❌❌
✔
3.0

The Growing Apache Spark Ecosystem
3.0 Improved Optimizer and Catalog
ACID Transactions
Bringing Sparks Scale to Pandas

3.0
Improved Optimizer and Catalog

Spark 3.0: Pluggable Data Catalog
DataSourceV2
• Pluggable catalog integration
• Improved pushdown
• Unified APIs for streaming and
batch
df.writeTo("catalog.db.table")
.overwrite($”year" === "2019")

Spark 3.0: Adaptive Query Execution
Make better optimization decisions during query execution.
Sort
Join
Sort
Join
Broadcast
No expensive
Sort!

Spark 3.0: Powerful Optimization
Scan
Filter
Join
Scan
Filter
Join
Dynamic partition pruning speeds up expensive joins.
Talk later
today!

World Class Performance for Warehousing
Spark 3.0 Improves TPC-DS
Performance by as much as 17x!
Spark wins TPC-DS
performance top spot!
0
5
10
15
20
25-v2.4
17-v2.4
15-v2.4
42-v2.4
6-v2.4
58-v2.4
56-v2.4
54-v2.4
71-v2.4
33-v2.4
60-v2.4
55-v2.4
52-v2.4
SpeedUp

And Much More…
3.0 PREVIEW COMING SOON!

Evolution of a Cutting-Edge Data Lake
Events
AI & Reporting
Streaming
Analytics
???

Easy, right?
Events
AI & Reporting
Streaming
Analytics

Challenge #1: Historical Queries?
Data Lake
λ-arch
λ-arch
Streaming
Analytics
AI & Reporting
Events
λ-arch1
1
1

Challenge #2: Messy Data?
Data Lake
λ-arch
λ-arch
Streaming
Analytics
AI & Reporting
Events
Validation
λ-arch
Validation
1
21
1
2

Reprocessing
Challenge #3: Mistakes and Failures?
Data Lake
λ-arch
λ-arch
Streaming
Analytics
AI & Reporting
Events
Validation
λ-arch
Validation
Reprocessing
Partitioned
1
2
3
1
1
3
2

Reprocessing
Challenge #4: Updates?
Data Lake
λ-arch
λ-arch
Streaming
Analytics
AI & Reporting
Events
Validation
λ-arch
Validation
Reprocessing
Updates
Partitioned
UPDATE &
MERGE
Scheduled to
Avoid
Modifications
1
2
3
1
1
3
4
4
4
2

Wasting Time & Money
Solving Systems Problems
Instead of Extracting Value From Data

Reprocessing
Challenges of the Data Lake
Data Lake
λ-arch
λ-arch
Streaming
Analytics
AI & Reporting
Events
Validation
λ-arch
Validation
Reprocessing
Updates
Partitioned
UPDATE &
MERGE
Scheduled to
Avoid
Modifications
1
2
3
1
1
3
4
4
4
2

Data Lake
AI & Reporting
Streaming
Analytics
Business-level
Aggregates
Filtered, Cleaned
Augmented
Raw
Ingestion
Bronze Silver Gold
CSV,
JSON, TXT…
Kinesis
Quality
Delta Lake allows you to incrementally improve the
quality of your data until it is ready for consumption.
*Data Quality Levels *
The Architecture

The Architecture
Data Lake
AI & Reporting
Business-level
Aggregates
Filtered, Cleaned
Augmented
Raw
Ingestion
Bronze Silver Gold
CSV,
JSON, TXT…
Kinesis
Streaming
Analytics
Full ACID Transactions
Focus on your data flow,
instead of worrying about failures.

The Architecture
Data Lake
AI & Reporting
Business-level
Aggregates
Filtered, Cleaned
Augmented
Raw
Ingestion
Bronze Silver Gold
CSV,
JSON, TXT…
Kinesis
Streaming
Analytics
Powered by
Unifies streaming / batch.
Convert existing jobs with minimal modifications.

Data Lake
AI & Reporting
Streaming
Analytics
Business-level
Aggregates
Filtered, Cleaned
Augmented
Raw
Ingestion
Bronze Silver Gold
CSV,
JSON, TXT…
Kinesis
Streams move data through the Delta Lake
• Low-latency or manually triggered
• Eliminates management of schedules and jobs
The Architecture

Data Lake
AI & Reporting
Streaming
Analytics
Business-level
Aggregates
Filtered, Cleaned
Augmented
Raw
Ingestion
Bronze Silver Gold
CSV,
JSON, TXT…
Kinesis
UPDATE
DELETE
MERGE
OVERWRITE
• Retention
• Corrections
• Change data capture
INSERT
The Architecture
Delta Lake also supports batch
jobs and standard DML

Delta Lake Community
~2+
Exabytes of Delta Read/Writes
3700+
Orgs using Delta
0
5,000
10,000
15,000
20,000
M
arch
April
M
ay
June
July
AugustSeptem
ber
Downloads

Announcing:
+
Delta Lake Joins the Linux Foundation!

Bringing the Power of
Apache Spark to Pandas

import pandas as pd
df = pd.read_csv('my_data.csv')
df.columns = ['x', 'y', 'z1']
df['x2'] = df.x * df.x
import databricks.koalas as ks
df = ks.read_delta(‘/lake/data')
df.columns = ['x', 'y', 'z1']
df['x2'] = df.x * df.x
This works great on
my laptop…
… but what if I have
more data?

10,000+
Downloads per day
204,452
Downloads this Sept
~100%
Month-over-month
download growth
21
Bi-weekly releases
Growing Koalas Ecosystem

Challenge: increasing scale and
complexity of
data operations
Struggling with the
“Spark switch” from pandas
More than 10X faster with less
than 1% code changes
How Virgin Hyperloop One reduced processing
time from hours to minutes with Koalas

Getting Started with Koalas
Docs and updates on github.com/databricks/koalas
Project docs are published on koalas.readthedocs.io
pip install koalas conda install koalasOR

The Spark Ecosystem is Exploding
Bringing the best characteristics of the Data Lake and
Traditional Relational Databases together:
Tomorrow:

New Developments in the Open Source Ecosystem: Apache Spark 3.0, Delta Lake, and Koalas

New Developments in the Open Source Ecosystem: Apache Spark 3.0, Delta Lake, and Koalas

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to New Developments in the Open Source Ecosystem: Apache Spark 3.0, Delta Lake, and Koalas

Similar to New Developments in the Open Source Ecosystem: Apache Spark 3.0, Delta Lake, and Koalas (20)

More from Databricks

More from Databricks (20)

Recently uploaded

Recently uploaded (20)

New Developments in the Open Source Ecosystem: Apache Spark 3.0, Delta Lake, and Koalas