<p>In this talk, we will highlight major efforts happening in the Spark ecosystem. In particular, we will dive into the details of adaptive and static query optimizations in Spark 3.0 to make Spark easier to use and faster to run. We will also demonstrate how new features in Koalas, an open source library that provides Pandas-like API on top of Spark, helps data scientists gain insights from their data quicker.</p>
Invezz.com - Grow your wealth with trading signals
New Developments in the Open Source Ecosystem: Apache Spark 3.0, Delta Lake, and Koalas
1. Urs Hölzle
@uhoelzle
@JeffDean @GCPcloud R.I.P. MapReduce. After having
served us well since 2003, today we removed the
remaining internal codebase for good.
MR was a seminal idea in 2003 but we've learned a lot
since then. [There are new systems that] express
pipelines more naturally with less code, and you get
both batch and streaming from the same code.
2019
2. ❌
Separate Compute & Storage
AI & More than SQL
Open Source at Scale
Data
Warehouse
Hadoop
M/R
❌
❌
✔
✔
✔
4. SQL & Optimization
Data Model & Catalog
ACID Transactions
✔
✔
✔
❌
❌
❌
❌
Separate Compute & Storage
More than SQL (i.e ML)
Open Source at Scale
Data
Warehouse
Hadoop
M/R
❌
❌
✔
✔
✔
✔
✔
✔
✔
Data Science at Scale ❌❌
✔
3.0
5. The Growing Apache Spark Ecosystem
3.0 Improved Optimizer and Catalog
ACID Transactions
Bringing Sparks Scale to Pandas
21. Reprocessing
Challenges of the Data Lake
Data Lake
λ-arch
λ-arch
Streaming
Analytics
AI & Reporting
Events
Validation
λ-arch
Validation
Reprocessing
Updates
Partitioned
UPDATE &
MERGE
Scheduled to
Avoid
Modifications
1
2
3
1
1
3
4
4
4
2
22. Data Lake
AI & Reporting
Streaming
Analytics
Business-level
Aggregates
Filtered, Cleaned
Augmented
Raw
Ingestion
Bronze Silver Gold
CSV,
JSON, TXT…
Kinesis
Quality
Delta Lake allows you to incrementally improve the
quality of your data until it is ready for consumption.
*Data Quality Levels *
The Architecture
23. The Architecture
Data Lake
AI & Reporting
Business-level
Aggregates
Filtered, Cleaned
Augmented
Raw
Ingestion
Bronze Silver Gold
CSV,
JSON, TXT…
Kinesis
Streaming
Analytics
Full ACID Transactions
Focus on your data flow,
instead of worrying about failures.
24. The Architecture
Data Lake
AI & Reporting
Business-level
Aggregates
Filtered, Cleaned
Augmented
Raw
Ingestion
Bronze Silver Gold
CSV,
JSON, TXT…
Kinesis
Streaming
Analytics
Powered by
Unifies streaming / batch.
Convert existing jobs with minimal modifications.
25. Data Lake
AI & Reporting
Streaming
Analytics
Business-level
Aggregates
Filtered, Cleaned
Augmented
Raw
Ingestion
Bronze Silver Gold
CSV,
JSON, TXT…
Kinesis
Streams move data through the Delta Lake
• Low-latency or manually triggered
• Eliminates management of schedules and jobs
The Architecture
26. Data Lake
AI & Reporting
Streaming
Analytics
Business-level
Aggregates
Filtered, Cleaned
Augmented
Raw
Ingestion
Bronze Silver Gold
CSV,
JSON, TXT…
Kinesis
UPDATE
DELETE
MERGE
OVERWRITE
• Retention
• Corrections
• Change data capture
INSERT
The Architecture
Delta Lake also supports batch
jobs and standard DML
27. Delta Lake Community
~2+
Exabytes of Delta Read/Writes
3700+
Orgs using Delta
0
5,000
10,000
15,000
20,000
M
arch
April
M
ay
June
July
AugustSeptem
ber
Downloads
32. import pandas as pd
df = pd.read_csv('my_data.csv')
df.columns = ['x', 'y', 'z1']
df['x2'] = df.x * df.x
import databricks.koalas as ks
df = ks.read_delta(‘/lake/data')
df.columns = ['x', 'y', 'z1']
df['x2'] = df.x * df.x
This works great on
my laptop…
… but what if I have
more data?
35. Challenge: increasing scale and
complexity of
data operations
Struggling with the
“Spark switch” from pandas
More than 10X faster with less
than 1% code changes
How Virgin Hyperloop One reduced processing
time from hours to minutes with Koalas
36. Getting Started with Koalas
Docs and updates on github.com/databricks/koalas
Project docs are published on koalas.readthedocs.io
pip install koalas conda install koalasOR