Spark + AI Summit recap jul16 2020

July 16, 2020
Spark + AI Summit 2020
(recap)

68,000+ registrations from 100+ countries!
https://databricks.com/sparkaisummit/north-america-2020/agenda

Agenda
What's new in Spark 3.0
Spark has just gotten 10 years old - let’s see some
of the enhancements in the latest release
Managing the ML lifecycle
MLFlow updates
The Lakehouse paradigm
One platform for data warehousing and data
science

Who are we?
Patricia Ferreiro Alonso
patricia.ferreiro@databricks.com
Dr. Piotr Majer
piotr.majer@databricks.com
Guido Oswald
guido@databricks.com / @guidooswald

Major Lessons
Focus on ease of use in both exploration and production
APIs to enable software best practices: composition,
testability, modularity
1
2

3400+ Resolved
JIRAs
in Spark 3.0 rc2

Adaptive Query
Execution
Dynamic Partition
Pruning
Query Compilation
Speedup
Join Hints
Performance
Richer APIs
Accelerator-aware
Scheduler
Built-in
Functions
pandas UDF
Enhancements
DELETE/UPDATE/
MERGE in Catalyst
Reserved
Keywords
Proleptic Gregorian
Calendar
ANSI Store
Assignment
Overﬂow
Checking
SQL
Compatibility
Built-in Data
Sources
Parquet/ORC Nested
Column Pruning
Parquet: Nested Column
Filter Pushdown
CSV Filter
Pushdown
New Binary
Data Source
Data Source V2 API +
Catalog Support
JDK 11 Support
Hadoop 3
Support
Hive 3.x Metastore
Hive 2.3 Execution
Extensibility and Ecosystem
Structured
Streaming UI
DDL/DML
Enhancements
Observable
Metrics
Event Log
Rollover
Monitoring and Debuggability

Performance
Achieve high performance for interactive, batch, streaming and ML workloads
Adaptive Query
Execution
Dynamic Partition
Pruning
Query Compilation
Speedup
Join Hints

Performance
Adaptive Query
Execution
Dynamic Partition
Pruning
Join Hints
Query Compilation
Speedup

Spark Catalyst Optimizer
Spark 1.x, Rule
Spark 2.x, Rule + Cost

Query Optimization in Spark 2.x
▪ Missing statistics
Expensive statistics collection
▪ Out-of-date statistics
Compute and storage separated
▪ Suboptimal Heuristics
Local
▪ Misestimated costs
Complex environments
User-deﬁned functions

Spark Catalyst Optimizer
Spark 1.x, Rule
Spark 2.x, Rule + Cost
Spark 3.0, Rule + Cost + Runtime

adaptive planning
Based on statistics of the ﬁnished plan nodes, re-optimize the execution plan of the remaining queries
Adaptive Query Execution

Blog post:
https://databricks.com/blog/2020/05/29/adaptive-query-execution-speeding-up-spark-sql-at-runtime.html
Based on statistics of the ﬁnished plan nodes, re-optimize
the execution plan of the remaining queries
• Convert Sort Merge Join to Broadcast Hash Join
• Shrink the number of reducers
• Handle skew join
Adaptive Query Execution

Convert Sort Merge Join to Broadcast Hash Join
Sort Merge
Join
Filter
Scan
Shuffle
Sort
Scan
Shuffle
Sort
Stage 1
Stage 2
Estimate
size:
100 MB
Estimate
size:
300 MB
Execute
Sort Merge
Join
Filter
Scan
Shuffle
Sort
Scan
Shuffle
Sort
Stage 1
Stage 2
Actual size:
86 MB
Actual size:
8 MB
Optimize Broadcast Hash Join
Filter
Scan
Shuffle
Broadcast
Scan
Shuffle
Stage 1
Stage 2
Actual size:
86 MB
Actual size:
8 MB

Dynamically Coalesce Shuffle Partitions
Filter
Scan
Shuffle (50
part.)
Sort
Stage 1
OptimizeExecute Filter
Scan
Shuffle (50
part.)
Sort
Stage 1
Filter
Scan
Shuffle (50
part.)
Sort
Stage 1
Coalesce 5 part.
▪ Set the initial partition number high to accommodate the largest data size of the
entire query execution
▪ Automatically coalesce partitions if needed after each query stage
detect 50 small
partitions

TABLE A - MAP 1
TABLE A - MAP 2
TABLE A - MAP 3
PART. A0
PART. A1
PART. A2
PART. A3
PART. B0
PART. B1
PART. B2
PART. B3
TABLE B - MAP 1
TABLE B - MAP 2
JOIN
Data Skew in Sort Merge Join

A0 - S2
TABLE A - MAP 1
B0
TABLE A - MAP 2
TABLE A - MAP 3
Split A0
PART. A1
PART. A2
PART. A3
PART. B1
PART. B2
PART. B3
TABLE B - MAP 1
TABLE B - MAP 2
A0 - S1
A0 - S0
B0
B0
Duplicate B0
JOIN
Dynamically Optimize Skew Joins

Performance
Achieve high performance for interactive, batch, streaming and ML
workloads
Adaptive Query
Execution
Dynamic Partition
Pruning
Join Hints
Query Compilation
Speedup

Dynamic Partition Pruning
• Avoid partition scanning based on
the query results of the other
query fragments.
• Important for star-schema
queries.
• Signiﬁcant speedup in TPC-DS.

t1: a large
fact table
with many
partitions
t2.id < 2
t2: a dimension
table with a
ﬁlter
SELECT t1.id, t2.pKey
FROM t1
JOIN t2
ON t1.pKey = t2.pKey
AND t2.id < 2
t1.pKey = t2.pKey
Project
Join
Filter
Scan Scan
Optimize

t1: a large
fact table
with many
partitions
t2.id < 2
t2: a dimension
table with a
ﬁlter
SELECT t1.id, t2.pKey
FROM t1
JOIN t2
ON t1.pKey = t2.pKey
AND t2.id < 2
t1.pKey = t2.pKey
Project
Join
Filter
Scan Scan
Optimize
Scan all the
partitions of t1
Filter
pushdown
t1.pkey IN (
SELECT t2.pKey
FROM t2
WHERE t2.id < 2)
t2.id < 2
Project
Join
Filter + Scan
Filter
Optimize
Scan
t1.pKey = t2.pKey

Scan all the
partitions of t1
t2.id < 2
Project
Join
Filter + Scan
Filter
Scan
t1.pKey = t2.pKey
Scan the
required
partitions of t2
t1.pKey in
DPPFilterResult

Scan all the
partitions of t1
t2.id < 2
Project
Join
Filter + Scan
Filter
Scan
t1.pKey = t2.pKey
Scan the
required
partitions of t2
t1.pKey in
DPPFilterResultOptimize
Scan the
required
partitions of t1
based on broadcast
hash join keys
t2.id < 2
Project
Join
Filter + Scan Filter + Scan
Scan the
required
partitions of t2
t1.pKey in
DPPFilterResult

Scan all the
partitions of t1
t2.id < 2
Project
Join
Filter + Scan
Filter
Scan
t1.pKey = t2.pKey
Scan the
required
partitions of t2
t1.pKey in
DPPFilterResultOptimize
Scan the
required
partitions of t1
t2.id < 2
Project
Join
Filter + Scan Filter + Scan
Scan the
required
partitions of t2
t1.pKey in
DPPFilterResult
90+% less ﬁle
scan
33 X
faster
Optimize

60 / 102 TPC-DS queries: a speedup between 2x and 18x

Performance
Adaptive Query
Execution
Dynamic Partition
Pruning
Join Hints
Query Compilation
Speedup

Optimizer Hints
▪ Join hints inﬂuence optimizer to choose the join strategies
▪ Broadcast hash join
▪ Sort-merge join
▪ Shuffle hash join
▪ Shuffle nested loop join
▪ Should be used with extreme caution.
▪ Difficult to manage over time.
NEW
NEW
NEW

▪ Broadcast Hash Join
SELECT /*+ BROADCAST(a) */ id FROM a JOIN b ON a.key = b.key
▪ Sort-Merge Join
SELECT /*+ MERGE(a, b) */ id FROM a JOIN b ON a.key = b.key
▪ Shuffle Hash Join
SELECT /*+ SHUFFLE_HASH(a, b) */ id FROM a JOIN b ON a.key = b.key
▪ Shuffle Nested Loop Join
SELECT /*+ SHUFFLE_REPLICATE_NL(a, b) */ id FROM a JOIN b
How to Use Join Hints?

3.0: SQL Compatibility
ANSI Reserved
Keywords
ANSI Gregorian
Calendar
ANSI Store
Assignment
ANSI Overﬂow
Checking
ANSI SQL language dialect and broad support

Python type hints for Pandas UDFs
3.0: Python Usability
Old API

3.0: Other Features
Structured Streaming UI
Observable stream metrics
SQL reference guide
Data Source v2 API
GPU scheduling support
...

Enable new use cases and simplify the Spark application development
Accelerator-aware
Scheduler
Built-in
Functions
pandas UDF
enhancements
DELETE/UPDATE/
MERGE in Catalyst
Richer APIs

Accelerator-aware
Scheduler
Built-in
Functions
pandas UDF
enhancements
DELETE/UPDATE
/MERGE in
Catalyst
Richer APIs

Accelerator-aware Scheduling
▪ Widely used for accelerating special workloads,
e.g., deep learning and signal processing.
▪ Supports Standalone, YARN and K8S.
▪ Supports GPU now, FPGA, TPU, etc. in the
future.
▪ Needs to specify required resources by conﬁgs
▪ Application level. Will support job/stage/task
level in the future.

Richer APIs
Accelerator-aware
Scheduler
Built-in
Functions
pandas UDF
enhancements
DELETE/UPDATE
/MERGE in
Catalyst

Richer APIs
Accelerator-aware
Scheduler
Built-in
Functions
pandas UDF
enhancements
DELETE/UPDATE/
MERGE in Catalyst

Python
UDF
for SQL
Python lambda
functions for
RDDs
Session-speci
ﬁc Python UDF
JAVA
UDF in
Python API
New Pandas UDF
Python Type Hints
Py
UDF
V 3.0V 0.7 V 1.2
2013 2015 20182014 2019/20202016 2017
V 2.0 V 2.1 V 2.3/2.4
https://databricks.com/blog/2020/05/20/new-pandas-udfs-and-python-type-hints-in-the-upcoming-release-of-apache-spark-3-0.html

Scalar Pandas UDF
[pandas.Series to pandas.Series]
SPARK 2.3 SPARK 3.0
Python Type Hints

Grouped Map Pandas Function API
[pandas.DataFrame to pandas.DataFrame]
SPARK 2.3 SPARK 3.0
Python Type Hints

Grouped Aggregate Pandas UDF
[pandas.Series to Scalar]
SPARK 2.4 SPARK 3.0
Python Type Hints

Map Pandas UDF
Cogrouped Map Pandas UDF
New Pandas Function APIs

Make monitoring and debugging Spark applications more comprehensive and stable
Structured
Streaming UI
DDL/DML
Enhancements
Observable
Metrics
Event Log
Rollover

Structured
Streaming UI
DDL/DML
Enhancements
Observable
Metrics
Event Log
Rollover

Structured
Streaming
UI
DDL/DML
Enhancements
Observable
Metrics
Event Log
Rollover

New Command EXPLAIN FORMATTED
*(1) Project [key#5, val#6]
+- *(1) Filter (isnotnull(key#5) AND (key#5 = Subquery scalar-subquery#15, [id=#113]))
   : +- Subquery scalar-subquery#15, [id=#113]
   : +- *(2) HashAggregate(keys=[], functions=[max(key#21)])
   : +- Exchange SinglePartition, true, [id=#109]
   : +- *(1) HashAggregate(keys=[], functions=[partial_max(key#21)])
   : +- *(1) Project [key#21]
   : +- *(1) Filter (isnotnull(val#22) AND (val#22 > 5))
   : +- *(1) ColumnarToRow
   : +- FileScan parquet default.tab2[key#21,val#22] Batched: true, DataFilters: [isnotnull(val#22), (val#22 > 5)],
Format: Parquet, Location: InMemoryFileIndex[ﬁle:/user/hive/warehouse/tab2], PartitionFilters: [], PushedFilters:
[IsNotNull(val), GreaterThan(val,5)], ReadSchema: struct<key:int,val:int>
   +- *(1) ColumnarToRow
+- FileScan parquet default.tab1[key#5,val#6] Batched: true, DataFilters: [isnotnull(key#5)], Format: Parquet, Location:
InMemoryFileIndex[ﬁle:/user/hive/warehouse/tab1], PartitionFilters: [], PushedFilters: [IsNotNull(key)], ReadSchema:
struct<key:int,val:int>
EXPLAIN FORMATTED
SELECT *
FROM tab1
WHERE key = (SELECT max(key)
FROM tab2
WHERE val > 5

* Project (4)
+- * Filter (3)
+- * ColumnarToRow (2)
+- Scan parquet default.tab1 (1)
(1) Scan parquet default.tab1
Output [2]: [key#5, val#6]
Batched: true
Location: InMemoryFileIndex [ﬁle:/user/hive/warehouse/tab1]
PushedFilters: [IsNotNull(key)]
ReadSchema: struct<key:int,val:int>

(2) ColumnarToRow [codegen id : 1]
Input [2]: [key#5, val#6]
(3) Filter [codegen id : 1]
Condition : (isnotnull(key#5) AND (key#5 = Subquery scalar-subquery#27, [id=#164]))

(4) Project [codegen id : 1]
EXPLAIN FORMATTED
SELECT *
FROM tab1
WHERE key = (SELECT max(key)
FROM tab2
WHERE val > 5

(5) Scan parquet default.tab2
Batched: true
Location: InMemoryFileIndex [ﬁle:/user/hive/warehouse/tab2]
PushedFilters: [IsNotNull(val), GreaterThan(val,5)]
ReadSchema: struct<key:int,val:int>
(6) ColumnarToRow [codegen id : 1]
(7) Filter [codegen id : 1]
Condition : (isnotnull(val#22) AND (val#22 > 5))
===== Subqueries =====
Subquery:1 Hosting operator id = 3 Hosting Expression = Subquery scalar-subquery#27, [id=#164]
* HashAggregate (11)
+- Exchange (10)
   +- * HashAggregate (9)
+- * Project (8)
   +- * Filter (7)
+- * ColumnarToRow (6)
   +- Scan parquet default.tab2 (5)
(8) Project [codegen id : 1]
Output [1]: [key#21]
(9) HashAggregate [codegen id : 1]
Input [1]: [key#21]
Keys: []
Functions [1]: [partial_max(key#21)]
Aggregate Attributes [1]: [max#35]
Results [1]: [max#36]
(10) Exchange
Input [1]: [max#36]
Arguments: SinglePartition, true, [id=#160]
(11) HashAggregate [codegen id : 2]
Input [1]: [max#36]
Keys: []
Functions [1]: [max(key#21)]
Aggregate Attributes [1]: [max(key#21)#33]
Results [1]: [max(key#21)#33 AS max(key)#34]

DDL/DML
Enhancements
Structured
Streaming
UI
Observable
Metrics
Event Log
Rollover

A ﬂexible way to monitor data quality.
Observable Metrics

Reduce the time and complexity of enabling applications that were written for other
relational database products to run in Spark SQL.
Reserved
Keywords in Parser
Proleptic Gregorian
Calendar
ANSI Store
Assignment
Overﬂow
Checking
SQL Compatibility

A safer way to do table insertion and avoid bad data.
ANSI store assignment + overﬂow check

Enhance the performance and functionalities of the built-in data sources
Parquet/ORC Nested
Column Pruning
Parquet: Nested Column
Filter Pushdown
New Binary
Data Source
CSV Filter
Pushdown
Built-in Data Sources

▪ Skip reading useless data blocks when only a few inner ﬁelds are
selected.
Better performance for nested ﬁelds

▪ Skip reading useless data blocks when there are predicates with
inner ﬁelds.
Better performance for nested ﬁelds

Improve the plug-in interface and extend the deployment environments
Data Source V2 API
+ Catalog Support
Hive 3.x Metastore
Hive 2.3 Execution
Hadoop 3
Support
JDK 11
Scala 2.12

Improve the plug-in interface and extend the deployment environments
Data Source V2 API
+ Catalog Support
Hive 3.x Metastore
Hive 2.3 Execution
Hadoop 3
Support
JDK 11
Support

Catalog plugin API
Users can register customized catalogs and use Spark to
access/manipulate table metadata directly.
JDBC data source v2 is coming in Spark 3.1!

To developers: When to use Data Source V2?
▪ Pick V2 if you want to provide catalog functionalities, as V1
doesn’t have such ability.
▪ Pick V2 If you want to support both batch and streaming, as V1
have separated APIs for batch and streaming which makes it hard
to reuse code.
▪ Pick V2 if you are sensitive to the scan performance, as V2 allows
you to report data partitioning to skip shuffle, and implement
vectorized reader for better performance.
Data source v2 API is not as stable as V1!

Spark 3.0 Builds
• Only builds with Scala 2.12
• Deprecates Python 2 (already EOL)
• Can build with various Hadoop/Hive versions
– Hadoop 2.7 + Hive 1.2
– Hadoop 2.7 + Hive 2.3 (supports Java 11) [Default]
– Hadoop 3.2 + Hive 2.3 (supports Java 11)
• Supports the following Hive metastore versions:
– "0.12", "0.13", "0.14", "1.0", "1.1", "1.2", "2.0", "2.1", "2.2", "2.3", "3.0", "3.1"

32 New Built-in Functions
▪ map
Documentation
• Web UI
• SQL reference
• Migration guide
• Semantic versioning guidelines

Try Databricks
Runtime 7.0
https://databricks.com/try-databricks
https://community.cloud.databricks.com/

MLﬂow joins the Linux Foundation
+

Autologging on Databricks (coming soon)
Recreate exact
conﬁguration of
an experiment
run

Model Registry - Model Signature (1.9)

Model Registry - Custom tags (coming
soon)
Tags & search in the Model Registry
to track custom metadata
APIs to integrate automatic CI/CD

Model Deployment - APIs (coming soon)

Model Serving on Databricks (coming soon)

What is Koalas?
Implementation of Pandas APIs over Spark
▪ Easily port existing data science code
Launched at Spark+AI Summit 2019
Now up to 850,000 downloads
per month (1/5th
of PySpark!)
import databricks.koalas as ks
df = ks.read_csv(file)
df[‘x’] = df.y * df.z
df.describe()
df.plot.line(...)

Announcing Koalas 1.0!
Close to 80% API coverage
Faster performance with Spark 3.0 APIs
More support for missing values, NA,
and in-place updates
Faster distributed index type
pip install koalas
to get started!
20% faster
Time(seconds)
26% faster

Data Warehouses
were purpose-built
for BI and reporting,
however…
▪ No support for video, audio, text
▪ No support for data science, ML
▪ Limited support for streaming
▪ Closed & proprietary formats
Therefore, most data is stored in
data lakes & blob stores
ETL
External Data Operational Data
Data Warehouses
BI Reports

Data Lakes
could handle all your data
for data science and ML,
however…
▪ Poor BI support
▪ Complex to set up and
maintain
▪ Poor performance
▪ Unreliable data swamps
BIData
Science
Machine
Learning
Structured, Semi-Structured and Unstructured
Data
Data Lake
Real-Time
Database
Reports
Data
Warehouses
Data Prep and
Validation
ETL

Coexistence is not a desirable strategy
BIData
Science
Machine
Learning
Data
Data Lake
Real-Time
Database
Reports
Data
Warehouses
Data Prep and
Validation
ETL
ETL
External Data Operational Data
Data Warehouses
BI Reports

Data Lakes are Key for Analytics
Data Lake
Data Science and ML
• Recommendation Engines
• Risk, Fraud, & Intrusion Detection
• Customer Analytics
• IoT & Predictive Maintenance
• Genomics & DNA Sequencing

Data Lake
Unreliable Low Quality Data
Slow Performance
Data Science and ML
• Recommendation Engines
• Risk, Fraud, & Intrusion Detection
• Customer Analytics
• IoT & Predictive Maintenance
• Genomics & DNA Sequencing
> 65% big data
projects fail per
Gartner
X
Data Lakes are Key for Analytics

Data Warehouse Data Lake
Streaming
Analytics
BI Data
Science
Machine
Learning
Data

Streaming
Analytics
BI Data
Science
Machine
Learning
Data
Data Lake for all your data
One platform for every use case
Structured transactional layer

Streaming
Analytics
BI Data
Science
Machine
Learning
Data
DELTA ENGINE

A new standard for building data lakes
An opinionated approach to Data Lakes
■ Brings reliability, quality and
performance to Data Lakes
■ Provides ACID transactions, scalable
metadata handling, and uniﬁes streaming
and batch data processing.
■ Fully compatible with Spark APIs.
■ Based on open source and open format
(Parquet)
0.7.0 released!
Support for SQL Delete, Update and Merge - With Spark 3.0
Lot of other improvements...

1. Hard to append data
2. Modiﬁcation of existing data difficult
3. Jobs failing mid way
4. Complex real-time operations
5. Costly to keep historical data versions
6. Difficult to handle large metadata
7. “Too many ﬁles” problems
8. Poor performance
9. Data quality issues
Data Lake Challenges

1. Hard to append data
2. Modiﬁcation of existing data difficult
3. Jobs failing mid way
4. Complex real-time operations
5. Costly to keep historical data versions
6. Difficult to handle large metadata
7. “Too many ﬁles” problems
8. Poor performance
9. Data quality issues
Data Lake Challenges
Schema validation & expectations
Indexing & compaction
Spark under the hood
ACID transactions

Building a Curated Data Lake
Expectations enable creating a curated Data Lake
Raw Ingestion
and History
BRONZE
Filtered,
Cleaned,
Augmented
SILVER
Business-level
Aggregates
GOLD

Getting started with Delta is very easy!
CREATE TABLE
...
USING parquet
CONVERT TO DELTA events
CREATE TABLE
...
USING delta
Create Tables using SQL syntax
Migrate Existing Tables

Lakehouse
Streaming
Analytics
BI Data
Science
Machine
Learning
Data

https://databricks.com/sparkaisummit/north-america-2020/agenda

Spark + AI Summit recap jul16 2020

More Related Content

What's hot

Similar to Spark + AI Summit recap jul16 2020

Recently uploaded

Spark + AI Summit recap jul16 2020