July 16, 2020
Spark + AI Summit 2020
(recap)
68,000+ registrations from 100+ countries!
https://databricks.com/sparkaisummit/north-america-2020/agenda
Agenda
What's new in Spark 3.0
Spark has just gotten 10 years old - let’s see some
of the enhancements in the latest release
Managing the ML lifecycle
MLFlow updates
The Lakehouse paradigm
One platform for data warehousing and data
science
Who are we?
Patricia Ferreiro Alonso
patricia.ferreiro@databricks.com
Dr. Piotr Majer
piotr.majer@databricks.com
Guido Oswald
guido@databricks.com / @guidooswald
What’s new in
3.0
Major Lessons
Focus on ease of use in both exploration and production
APIs to enable software best practices: composition,
testability, modularity
1
2
3400+ Resolved
JIRAs
in Spark 3.0 rc2
Adaptive Query
Execution
Dynamic Partition
Pruning
Query Compilation
Speedup
Join Hints
Performance
Richer APIs
Accelerator-aware
Scheduler
Built-in
Functions
pandas UDF
Enhancements
DELETE/UPDATE/
MERGE in Catalyst
Reserved
Keywords
Proleptic Gregorian
Calendar
ANSI Store
Assignment
Overflow
Checking
SQL
Compatibility
Built-in Data
Sources
Parquet/ORC Nested
Column Pruning
Parquet: Nested Column
Filter Pushdown
CSV Filter
Pushdown
New Binary
Data Source
Data Source V2 API +
Catalog Support
JDK 11 Support
Hadoop 3
Support
Hive 3.x Metastore
Hive 2.3 Execution
Extensibility and Ecosystem
Structured
Streaming UI
DDL/DML
Enhancements
Observable
Metrics
Event Log
Rollover
Monitoring and Debuggability
Performance
Achieve high performance for interactive, batch, streaming and ML workloads
Adaptive Query
Execution
Dynamic Partition
Pruning
Query Compilation
Speedup
Join Hints
Performance
Adaptive Query
Execution
Dynamic Partition
Pruning
Join Hints
Query Compilation
Speedup
Achieve high performance for interactive, batch, streaming and ML workloads
Spark Catalyst Optimizer
Spark 1.x, Rule
Spark 2.x, Rule + Cost
Query Optimization in Spark 2.x
▪ Missing statistics
Expensive statistics collection
▪ Out-of-date statistics
Compute and storage separated
▪ Suboptimal Heuristics
Local
▪ Misestimated costs
Complex environments
User-defined functions
Spark Catalyst Optimizer
Spark 1.x, Rule
Spark 2.x, Rule + Cost
Spark 3.0, Rule + Cost + Runtime
adaptive planning
Based on statistics of the finished plan nodes, re-optimize the execution plan of the remaining queries
Adaptive Query Execution
Blog post:
https://databricks.com/blog/2020/05/29/adaptive-query-execution-speeding-up-spark-sql-at-runtime.html
Based on statistics of the finished plan nodes, re-optimize
the execution plan of the remaining queries
• Convert Sort Merge Join to Broadcast Hash Join
• Shrink the number of reducers
• Handle skew join
Adaptive Query Execution
Convert Sort Merge Join to Broadcast Hash Join
Sort Merge
Join
Filter
Scan
Shuffle
Sort
Scan
Shuffle
Sort
Stage 1
Stage 2
Estimate
size:
100 MB
Estimate
size:
300 MB
Execute
Sort Merge
Join
Filter
Scan
Shuffle
Sort
Scan
Shuffle
Sort
Stage 1
Stage 2
Actual size:
86 MB
Actual size:
8 MB
Optimize Broadcast Hash Join
Filter
Scan
Shuffle
Broadcast
Scan
Shuffle
Stage 1
Stage 2
Actual size:
86 MB
Actual size:
8 MB
Dynamically Coalesce Shuffle Partitions
Filter
Scan
Shuffle (50
part.)
Sort
Stage 1
OptimizeExecute Filter
Scan
Shuffle (50
part.)
Sort
Stage 1
Filter
Scan
Shuffle (50
part.)
Sort
Stage 1
Coalesce 5 part.
▪ Set the initial partition number high to accommodate the largest data size of the
entire query execution
▪ Automatically coalesce partitions if needed after each query stage
detect 50 small
partitions
TABLE A - MAP 1
TABLE A - MAP 2
TABLE A - MAP 3
PART. A0
PART. A1
PART. A2
PART. A3
PART. B0
PART. B1
PART. B2
PART. B3
TABLE B - MAP 1
TABLE B - MAP 2
JOIN
Data Skew in Sort Merge Join
A0 - S2
TABLE A - MAP 1
B0
TABLE A - MAP 2
TABLE A - MAP 3
Split A0
PART. A1
PART. A2
PART. A3
PART. B1
PART. B2
PART. B3
TABLE B - MAP 1
TABLE B - MAP 2
A0 - S1
A0 - S0
B0
B0
Duplicate B0
JOIN
Dynamically Optimize Skew Joins
Adaptive Query Execution
Performance
Achieve high performance for interactive, batch, streaming and ML
workloads
Adaptive Query
Execution
Dynamic Partition
Pruning
Join Hints
Query Compilation
Speedup
Dynamic Partition Pruning
• Avoid partition scanning based on
the query results of the other
query fragments.
• Important for star-schema
queries.
• Significant speedup in TPC-DS.
t1: a large
fact table
with many
partitions
t2.id < 2
t2: a dimension
table with a
filter
SELECT t1.id, t2.pKey
FROM t1 
JOIN t2 
     ON t1.pKey = t2.pKey
  AND t2.id < 2 
t1.pKey = t2.pKey
Dynamic Partition Pruning
Project
Join
Filter
Scan Scan
Optimize
t1: a large
fact table
with many
partitions
t2.id < 2
t2: a dimension
table with a
filter
SELECT t1.id, t2.pKey
FROM t1 
JOIN t2 
     ON t1.pKey = t2.pKey
  AND t2.id < 2 
t1.pKey = t2.pKey
Dynamic Partition Pruning
Project
Join
Filter
Scan Scan
Optimize
Scan all the
partitions of t1
Filter
pushdown
t1.pkey IN (
SELECT t2.pKey
FROM t2
WHERE t2.id < 2)
t2.id < 2
Project
Join
Filter + Scan
Filter
Optimize
Scan
t1.pKey = t2.pKey
Dynamic Partition Pruning
Scan all the
partitions of t1
t2.id < 2
Project
Join
Filter + Scan
Filter
Scan
t1.pKey = t2.pKey
Scan the
required
partitions of t2
t1.pKey in
DPPFilterResult
Dynamic Partition Pruning
Scan all the
partitions of t1
t2.id < 2
Project
Join
Filter + Scan
Filter
Scan
t1.pKey = t2.pKey
Scan the
required
partitions of t2
t1.pKey in
DPPFilterResultOptimize
Scan the
required
partitions of t1
based on broadcast
hash join keys
t2.id < 2
Project
Join
Filter + Scan Filter + Scan
Scan the
required
partitions of t2
t1.pKey in
DPPFilterResult
Dynamic Partition Pruning
Scan all the
partitions of t1
t2.id < 2
Project
Join
Filter + Scan
Filter
Scan
t1.pKey = t2.pKey
Scan the
required
partitions of t2
t1.pKey in
DPPFilterResultOptimize
Scan the
required
partitions of t1
t2.id < 2
Project
Join
Filter + Scan Filter + Scan
Scan the
required
partitions of t2
t1.pKey in
DPPFilterResult
90+% less file
scan
33 X
faster
Optimize
Dynamic Partition Pruning
60 / 102 TPC-DS queries: a speedup between 2x and 18x
Performance
Achieve high performance for interactive, batch, streaming and ML workloads
Adaptive Query
Execution
Dynamic Partition
Pruning
Join Hints
Query Compilation
Speedup
Optimizer Hints
▪ Join hints influence optimizer to choose the join strategies
▪ Broadcast hash join
▪ Sort-merge join
▪ Shuffle hash join
▪ Shuffle nested loop join
▪ Should be used with extreme caution.
▪ Difficult to manage over time.
NEW
NEW
NEW
▪ Broadcast Hash Join
SELECT /*+ BROADCAST(a) */ id FROM a JOIN b ON a.key = b.key
▪ Sort-Merge Join
SELECT /*+ MERGE(a, b) */ id FROM a JOIN b ON a.key = b.key
▪ Shuffle Hash Join
SELECT /*+ SHUFFLE_HASH(a, b) */ id FROM a JOIN b ON a.key = b.key
▪ Shuffle Nested Loop Join
SELECT /*+ SHUFFLE_REPLICATE_NL(a, b) */ id FROM a JOIN b
How to Use Join Hints?
3.0: SQL Compatibility
ANSI Reserved
Keywords
ANSI Gregorian
Calendar
ANSI Store
Assignment
ANSI Overflow
Checking
ANSI SQL language dialect and broad support
Python type hints for Pandas UDFs
3.0: Python Usability
Old API
3.0: Other Features
Structured Streaming UI
Observable stream metrics
SQL reference guide
Data Source v2 API
GPU scheduling support
...
Enable new use cases and simplify the Spark application development
Accelerator-aware
Scheduler
Built-in
Functions
pandas UDF
enhancements
DELETE/UPDATE/
MERGE in Catalyst
Richer APIs
Enable new use cases and simplify the Spark application development
Accelerator-aware
Scheduler
Built-in
Functions
pandas UDF
enhancements
DELETE/UPDATE
/MERGE in
Catalyst
Richer APIs
Accelerator-aware Scheduling
▪ Widely used for accelerating special workloads,
e.g., deep learning and signal processing.
▪ Supports Standalone, YARN and K8S.
▪ Supports GPU now, FPGA, TPU, etc. in the
future.
▪ Needs to specify required resources by configs
▪ Application level. Will support job/stage/task
level in the future.
Enable new use cases and simplify the Spark application development
Richer APIs
Accelerator-aware
Scheduler
Built-in
Functions
pandas UDF
enhancements
DELETE/UPDATE
/MERGE in
Catalyst
32 New Built-in Functions
Enable new use cases and simplify the Spark application development
Richer APIs
Accelerator-aware
Scheduler
Built-in
Functions
pandas UDF
enhancements
DELETE/UPDATE/
MERGE in Catalyst
Python
UDF
for SQL
Python lambda
functions for
RDDs
Session-speci
fic Python UDF
JAVA
UDF in
Python API
New Pandas UDF
Python Type Hints
Py
UDF
V 3.0V 0.7 V 1.2
2013 2015 20182014 2019/20202016 2017
V 2.0 V 2.1 V 2.3/2.4
https://databricks.com/blog/2020/05/20/new-pandas-udfs-and-python-type-hints-in-the-upcoming-release-of-apache-spark-3-0.html
Scalar Pandas UDF
[pandas.Series to pandas.Series]
SPARK 2.3 SPARK 3.0
Python Type Hints
Grouped Map Pandas Function API
[pandas.DataFrame to pandas.DataFrame]
SPARK 2.3 SPARK 3.0
Python Type Hints
Grouped Aggregate Pandas UDF
[pandas.Series to Scalar]
SPARK 2.4 SPARK 3.0
Python Type Hints
New Pandas UDF Types
Map Pandas UDF
Cogrouped Map Pandas UDF
New Pandas Function APIs
Make monitoring and debugging Spark applications more comprehensive and stable
Structured
Streaming UI
DDL/DML
Enhancements
Observable
Metrics
Event Log
Rollover
Monitoring and Debuggability
Make monitoring and debugging Spark applications more comprehensive and stable
Monitoring and Debuggability
Structured
Streaming UI
DDL/DML
Enhancements
Observable
Metrics
Event Log
Rollover
Make monitoring and debugging Spark applications more comprehensive and stable
Monitoring and Debuggability
Structured
Streaming
UI
DDL/DML
Enhancements
Observable
Metrics
Event Log
Rollover
New Command EXPLAIN FORMATTED
*(1) Project [key#5, val#6]
+- *(1) Filter (isnotnull(key#5) AND (key#5 = Subquery scalar-subquery#15, [id=#113]))
   :  +- Subquery scalar-subquery#15, [id=#113]
   :     +- *(2) HashAggregate(keys=[], functions=[max(key#21)])
   :        +- Exchange SinglePartition, true, [id=#109]
   :           +- *(1) HashAggregate(keys=[], functions=[partial_max(key#21)])
   :              +- *(1) Project [key#21]
   :                 +- *(1) Filter (isnotnull(val#22) AND (val#22 > 5))
   :                    +- *(1) ColumnarToRow
   :                       +- FileScan parquet default.tab2[key#21,val#22] Batched: true, DataFilters: [isnotnull(val#22), (val#22 > 5)],
Format: Parquet, Location: InMemoryFileIndex[file:/user/hive/warehouse/tab2], PartitionFilters: [], PushedFilters:
[IsNotNull(val), GreaterThan(val,5)], ReadSchema: struct<key:int,val:int>
   +- *(1) ColumnarToRow
      +- FileScan parquet default.tab1[key#5,val#6] Batched: true, DataFilters: [isnotnull(key#5)], Format: Parquet, Location:
InMemoryFileIndex[file:/user/hive/warehouse/tab1], PartitionFilters: [], PushedFilters: [IsNotNull(key)], ReadSchema:
struct<key:int,val:int>
EXPLAIN FORMATTED
SELECT *
FROM tab1
WHERE key = (SELECT max(key)
FROM tab2
WHERE val > 5
* Project (4)
+- * Filter (3)
   +- * ColumnarToRow (2)
      +- Scan parquet default.tab1 (1)
(1) Scan parquet default.tab1 
Output [2]: [key#5, val#6]
Batched: true
Location: InMemoryFileIndex [file:/user/hive/warehouse/tab1]
PushedFilters: [IsNotNull(key)]
ReadSchema: struct<key:int,val:int>
     
(2) ColumnarToRow [codegen id : 1]
Input [2]: [key#5, val#6]
(3) Filter [codegen id : 1]
Input [2]: [key#5, val#6]
Condition : (isnotnull(key#5) AND (key#5 = Subquery scalar-subquery#27, [id=#164]))
     
(4) Project [codegen id : 1]
Output [2]: [key#5, val#6]
Input [2]: [key#5, val#6]
EXPLAIN FORMATTED
SELECT *
FROM tab1
WHERE key = (SELECT max(key)
FROM tab2
WHERE val > 5
(5) Scan parquet default.tab2 
Output [2]: [key#21, val#22]
Batched: true
Location: InMemoryFileIndex [file:/user/hive/warehouse/tab2]
PushedFilters: [IsNotNull(val), GreaterThan(val,5)]
ReadSchema: struct<key:int,val:int>
(6) ColumnarToRow [codegen id : 1]
Input [2]: [key#21, val#22]
(7) Filter [codegen id : 1]
Input [2]: [key#21, val#22]
Condition : (isnotnull(val#22) AND (val#22 > 5))
===== Subqueries =====
Subquery:1 Hosting operator id = 3 Hosting Expression = Subquery scalar-subquery#27, [id=#164]
* HashAggregate (11)
+- Exchange (10)
   +- * HashAggregate (9)
      +- * Project (8)
         +- * Filter (7)
            +- * ColumnarToRow (6)
               +- Scan parquet default.tab2 (5)
(8) Project [codegen id : 1]
Output [1]: [key#21]
Input [2]: [key#21, val#22]
(9) HashAggregate [codegen id : 1]
Input [1]: [key#21]
Keys: []
Functions [1]: [partial_max(key#21)]
Aggregate Attributes [1]: [max#35]
Results [1]: [max#36]
(10) Exchange 
Input [1]: [max#36]
Arguments: SinglePartition, true, [id=#160]
(11) HashAggregate [codegen id : 2]
Input [1]: [max#36]
Keys: []
Functions [1]: [max(key#21)]
Aggregate Attributes [1]: [max(key#21)#33]
Results [1]: [max(key#21)#33 AS max(key)#34]
DDL/DML
Enhancements
Make monitoring and debugging Spark applications more comprehensive and stable
Monitoring and Debuggability
Structured
Streaming
UI
Observable
Metrics
Event Log
Rollover
A flexible way to monitor data quality.
Observable Metrics
Reduce the time and complexity of enabling applications that were written for other
relational database products to run in Spark SQL.
Reserved
Keywords in Parser
Proleptic Gregorian
Calendar
ANSI Store
Assignment
Overflow
Checking
SQL Compatibility
Reduce the time and complexity of enabling applications that were written for other
relational database products to run in Spark SQL.
Reserved
Keywords in Parser
Proleptic Gregorian
Calendar
ANSI Store
Assignment
Overflow
Checking
SQL Compatibility
A safer way to do table insertion and avoid bad data.
ANSI store assignment + overflow check
A safer way to do table insertion and avoid bad data.
ANSI store assignment + overflow check
Enhance the performance and functionalities of the built-in data sources
Parquet/ORC Nested
Column Pruning
Parquet: Nested Column
Filter Pushdown
New Binary
Data Source
CSV Filter
Pushdown
Built-in Data Sources
Enhance the performance and functionalities of the built-in data sources
Parquet/ORC Nested
Column Pruning
Parquet: Nested Column
Filter Pushdown
New Binary
Data Source
CSV Filter
Pushdown
Built-in Data Sources
▪ Skip reading useless data blocks when only a few inner fields are
selected.
Better performance for nested fields
▪ Skip reading useless data blocks when there are predicates with
inner fields.
Better performance for nested fields
Improve the plug-in interface and extend the deployment environments
Data Source V2 API
+ Catalog Support
Hive 3.x Metastore
Hive 2.3 Execution
Hadoop 3
Support
JDK 11
Scala 2.12
Extensibility and Ecosystem
Improve the plug-in interface and extend the deployment environments
Data Source V2 API
+ Catalog Support
Hive 3.x Metastore
Hive 2.3 Execution
Hadoop 3
Support
JDK 11
Support
Extensibility and Ecosystem
Catalog plugin API
Users can register customized catalogs and use Spark to
access/manipulate table metadata directly.
JDBC data source v2 is coming in Spark 3.1!
To developers: When to use Data Source V2?
▪ Pick V2 if you want to provide catalog functionalities, as V1
doesn’t have such ability.
▪ Pick V2 If you want to support both batch and streaming, as V1
have separated APIs for batch and streaming which makes it hard
to reuse code.
▪ Pick V2 if you are sensitive to the scan performance, as V2 allows
you to report data partitioning to skip shuffle, and implement
vectorized reader for better performance.
Data source v2 API is not as stable as V1!
Improve the plug-in interface and extend the deployment environments
Data Source V2 API
+ Catalog Support
Hive 3.x Metastore
Hive 2.3 Execution
Hadoop 3
Support
JDK 11
Support
Extensibility and Ecosystem
Spark 3.0 Builds
• Only builds with Scala 2.12
• Deprecates Python 2 (already EOL)
• Can build with various Hadoop/Hive versions
– Hadoop 2.7 + Hive 1.2
– Hadoop 2.7 + Hive 2.3 (supports Java 11)  [Default]
– Hadoop 3.2 + Hive 2.3 (supports Java 11)
• Supports the following Hive metastore versions:
– "0.12", "0.13", "0.14", "1.0", "1.1", "1.2", "2.0", "2.1", "2.2", "2.3", "3.0", "3.1"
32 New Built-in Functions
▪ map
Documentation
• Web UI
• SQL reference
• Migration guide
• Semantic versioning guidelines
Adaptive Query
Execution
Dynamic Partition
Pruning
Query Compilation
Speedup
Join Hints
Performance
Richer APIs
Accelerator-aware
Scheduler
Built-in
Functions
pandas UDF
Enhancements
DELETE/UPDATE/
MERGE in Catalyst
Reserved
Keywords
Proleptic Gregorian
Calendar
ANSI Store
Assignment
Overflow
Checking
SQL
Compatibility
Built-in Data
Sources
Parquet/ORC Nested
Column Pruning
Parquet: Nested Column
Filter Pushdown
CSV Filter
Pushdown
New Binary
Data Source
Data Source V2 API +
Catalog Support
JDK 11 Support
Hadoop 3
Support
Hive 3.x Metastore
Hive 2.3 Execution
Extensibility and Ecosystem
Structured
Streaming UI
DDL/DML
Enhancements
Observable
Metrics
Event Log
Rollover
Monitoring and Debuggability
Try Databricks
Runtime 7.0
https://databricks.com/try-databricks
https://community.cloud.databricks.com/
Managing the ML Lifecycle
MLflow joins the Linux Foundation
+
Autologging
Autologging
Autologging
Autologging on Databricks (coming soon)
Recreate exact
configuration of
an experiment
run
Model Registry - Model Signature (1.9)
Model Registry - Custom tags (coming
soon)
Tags & search in the Model Registry
to track custom metadata
APIs to integrate automatic CI/CD
Model Deployment - APIs (coming soon)
Model Serving on Databricks (coming soon)
Model Serving on Databricks (coming soon)
What is Koalas?
Implementation of Pandas APIs over Spark
▪ Easily port existing data science code
Launched at Spark+AI Summit 2019
Now up to 850,000 downloads
per month (1/5th
of PySpark!)
import databricks.koalas as ks
df = ks.read_csv(file)
df[‘x’] = df.y * df.z
df.describe()
df.plot.line(...)
Announcing Koalas 1.0!
Close to 80% API coverage
Faster performance with Spark 3.0 APIs
More support for missing values, NA,
and in-place updates
Faster distributed index type
pip install koalas
to get started!
20% faster
Time(seconds)
26% faster
The Lakehouse paradigm
Data Warehouses
were purpose-built
for BI and reporting,
however…
▪ No support for video, audio, text
▪ No support for data science, ML
▪ Limited support for streaming
▪ Closed & proprietary formats
Therefore, most data is stored in
data lakes & blob stores
ETL
External Data Operational Data
Data Warehouses
BI Reports
Data Lakes
could handle all your data
for data science and ML,
however…
▪ Poor BI support
▪ Complex to set up and
maintain
▪ Poor performance
▪ Unreliable data swamps
BIData
Science
Machine
Learning
Structured, Semi-Structured and Unstructured
Data
Data Lake
Real-Time
Database
Reports
Data
Warehouses
Data Prep and
Validation
ETL
Coexistence is not a desirable strategy
BIData
Science
Machine
Learning
Structured, Semi-Structured and Unstructured
Data
Data Lake
Real-Time
Database
Reports
Data
Warehouses
Data Prep and
Validation
ETL
ETL
External Data Operational Data
Data Warehouses
BI Reports
Data Lakes are Key for Analytics
Data Lake
Data Science and ML
• Recommendation Engines
• Risk, Fraud, & Intrusion Detection
• Customer Analytics
• IoT & Predictive Maintenance
• Genomics & DNA Sequencing
Data Lake
Unreliable Low Quality Data
Slow Performance
Data Science and ML
• Recommendation Engines
• Risk, Fraud, & Intrusion Detection
• Customer Analytics
• IoT & Predictive Maintenance
• Genomics & DNA Sequencing
> 65% big data
projects fail per
Gartner
X
Data Lakes are Key for Analytics
The Lakehouse paradigm
Data Warehouse Data Lake
Streaming
Analytics
BI Data
Science
Machine
Learning
Structured, Semi-Structured and Unstructured
Data
Streaming
Analytics
BI Data
Science
Machine
Learning
Structured, Semi-Structured and Unstructured
Data
Data Lake for all your data
One platform for every use case
Structured transactional layer
The Lakehouse paradigm
Streaming
Analytics
BI Data
Science
Machine
Learning
Structured, Semi-Structured and Unstructured
Data
Data Lake for all your data
One platform for every use case
Structured transactional layer
The Lakehouse paradigm
Streaming
Analytics
BI Data
Science
Machine
Learning
Structured, Semi-Structured and Unstructured
Data
DELTA ENGINE
Data Lake for all your data
One platform for every use case
Structured transactional layer
The Lakehouse paradigm
A new standard for building data lakes
An opinionated approach to Data Lakes
■ Brings reliability, quality and
performance to Data Lakes
■ Provides ACID transactions, scalable
metadata handling, and unifies streaming
and batch data processing.
■ Fully compatible with Spark APIs.
■ Based on open source and open format
(Parquet)
0.7.0 released!
Support for SQL Delete, Update and Merge - With Spark 3.0
Lot of other improvements...
1. Hard to append data
2. Modification of existing data difficult
3. Jobs failing mid way
4. Complex real-time operations
5. Costly to keep historical data versions
6. Difficult to handle large metadata
7. “Too many files” problems
8. Poor performance
9. Data quality issues
Data Lake Challenges
1. Hard to append data
2. Modification of existing data difficult
3. Jobs failing mid way
4. Complex real-time operations
5. Costly to keep historical data versions
6. Difficult to handle large metadata
7. “Too many files” problems
8. Poor performance
9. Data quality issues
Data Lake Challenges
Schema validation & expectations
Indexing & compaction
Spark under the hood
ACID transactions
Building a Curated Data Lake
Expectations enable creating a curated Data Lake
Raw Ingestion
and History
BRONZE
Filtered,
Cleaned,
Augmented
SILVER
Business-level
Aggregates
GOLD
Getting started with Delta is very easy!
CREATE TABLE
...
USING parquet
CONVERT TO DELTA events
CREATE TABLE
...
USING delta
Create Tables using SQL syntax
Migrate Existing Tables
Lakehouse
Streaming
Analytics
BI Data
Science
Machine
Learning
Structured, Semi-Structured and Unstructured
Data
Data Lake for all your data
One platform for every use case
Structured transactional layer
https://databricks.com/sparkaisummit/north-america-2020/agenda
Thank you!

Spark + AI Summit recap jul16 2020

  • 1.
    July 16, 2020 Spark+ AI Summit 2020 (recap)
  • 2.
    68,000+ registrations from100+ countries! https://databricks.com/sparkaisummit/north-america-2020/agenda
  • 3.
    Agenda What's new inSpark 3.0 Spark has just gotten 10 years old - let’s see some of the enhancements in the latest release Managing the ML lifecycle MLFlow updates The Lakehouse paradigm One platform for data warehousing and data science
  • 4.
    Who are we? PatriciaFerreiro Alonso patricia.ferreiro@databricks.com Dr. Piotr Majer piotr.majer@databricks.com Guido Oswald guido@databricks.com / @guidooswald
  • 5.
  • 6.
    Major Lessons Focus onease of use in both exploration and production APIs to enable software best practices: composition, testability, modularity 1 2
  • 7.
  • 8.
    Adaptive Query Execution Dynamic Partition Pruning QueryCompilation Speedup Join Hints Performance Richer APIs Accelerator-aware Scheduler Built-in Functions pandas UDF Enhancements DELETE/UPDATE/ MERGE in Catalyst Reserved Keywords Proleptic Gregorian Calendar ANSI Store Assignment Overflow Checking SQL Compatibility Built-in Data Sources Parquet/ORC Nested Column Pruning Parquet: Nested Column Filter Pushdown CSV Filter Pushdown New Binary Data Source Data Source V2 API + Catalog Support JDK 11 Support Hadoop 3 Support Hive 3.x Metastore Hive 2.3 Execution Extensibility and Ecosystem Structured Streaming UI DDL/DML Enhancements Observable Metrics Event Log Rollover Monitoring and Debuggability
  • 9.
    Performance Achieve high performancefor interactive, batch, streaming and ML workloads Adaptive Query Execution Dynamic Partition Pruning Query Compilation Speedup Join Hints
  • 10.
    Performance Adaptive Query Execution Dynamic Partition Pruning JoinHints Query Compilation Speedup Achieve high performance for interactive, batch, streaming and ML workloads
  • 11.
    Spark Catalyst Optimizer Spark1.x, Rule Spark 2.x, Rule + Cost
  • 12.
    Query Optimization inSpark 2.x ▪ Missing statistics Expensive statistics collection ▪ Out-of-date statistics Compute and storage separated ▪ Suboptimal Heuristics Local ▪ Misestimated costs Complex environments User-defined functions
  • 13.
    Spark Catalyst Optimizer Spark1.x, Rule Spark 2.x, Rule + Cost Spark 3.0, Rule + Cost + Runtime
  • 14.
    adaptive planning Based onstatistics of the finished plan nodes, re-optimize the execution plan of the remaining queries Adaptive Query Execution
  • 15.
    Blog post: https://databricks.com/blog/2020/05/29/adaptive-query-execution-speeding-up-spark-sql-at-runtime.html Based onstatistics of the finished plan nodes, re-optimize the execution plan of the remaining queries • Convert Sort Merge Join to Broadcast Hash Join • Shrink the number of reducers • Handle skew join Adaptive Query Execution
  • 16.
    Convert Sort MergeJoin to Broadcast Hash Join Sort Merge Join Filter Scan Shuffle Sort Scan Shuffle Sort Stage 1 Stage 2 Estimate size: 100 MB Estimate size: 300 MB Execute Sort Merge Join Filter Scan Shuffle Sort Scan Shuffle Sort Stage 1 Stage 2 Actual size: 86 MB Actual size: 8 MB Optimize Broadcast Hash Join Filter Scan Shuffle Broadcast Scan Shuffle Stage 1 Stage 2 Actual size: 86 MB Actual size: 8 MB
  • 17.
    Dynamically Coalesce ShufflePartitions Filter Scan Shuffle (50 part.) Sort Stage 1 OptimizeExecute Filter Scan Shuffle (50 part.) Sort Stage 1 Filter Scan Shuffle (50 part.) Sort Stage 1 Coalesce 5 part. ▪ Set the initial partition number high to accommodate the largest data size of the entire query execution ▪ Automatically coalesce partitions if needed after each query stage detect 50 small partitions
  • 18.
    TABLE A -MAP 1 TABLE A - MAP 2 TABLE A - MAP 3 PART. A0 PART. A1 PART. A2 PART. A3 PART. B0 PART. B1 PART. B2 PART. B3 TABLE B - MAP 1 TABLE B - MAP 2 JOIN Data Skew in Sort Merge Join
  • 19.
    A0 - S2 TABLEA - MAP 1 B0 TABLE A - MAP 2 TABLE A - MAP 3 Split A0 PART. A1 PART. A2 PART. A3 PART. B1 PART. B2 PART. B3 TABLE B - MAP 1 TABLE B - MAP 2 A0 - S1 A0 - S0 B0 B0 Duplicate B0 JOIN Dynamically Optimize Skew Joins
  • 20.
  • 21.
    Performance Achieve high performancefor interactive, batch, streaming and ML workloads Adaptive Query Execution Dynamic Partition Pruning Join Hints Query Compilation Speedup
  • 22.
    Dynamic Partition Pruning •Avoid partition scanning based on the query results of the other query fragments. • Important for star-schema queries. • Significant speedup in TPC-DS.
  • 23.
    t1: a large facttable with many partitions t2.id < 2 t2: a dimension table with a filter SELECT t1.id, t2.pKey FROM t1  JOIN t2       ON t1.pKey = t2.pKey   AND t2.id < 2  t1.pKey = t2.pKey Dynamic Partition Pruning Project Join Filter Scan Scan Optimize
  • 24.
    t1: a large facttable with many partitions t2.id < 2 t2: a dimension table with a filter SELECT t1.id, t2.pKey FROM t1  JOIN t2       ON t1.pKey = t2.pKey   AND t2.id < 2  t1.pKey = t2.pKey Dynamic Partition Pruning Project Join Filter Scan Scan Optimize Scan all the partitions of t1 Filter pushdown t1.pkey IN ( SELECT t2.pKey FROM t2 WHERE t2.id < 2) t2.id < 2 Project Join Filter + Scan Filter Optimize Scan t1.pKey = t2.pKey
  • 25.
    Dynamic Partition Pruning Scanall the partitions of t1 t2.id < 2 Project Join Filter + Scan Filter Scan t1.pKey = t2.pKey Scan the required partitions of t2 t1.pKey in DPPFilterResult
  • 26.
    Dynamic Partition Pruning Scanall the partitions of t1 t2.id < 2 Project Join Filter + Scan Filter Scan t1.pKey = t2.pKey Scan the required partitions of t2 t1.pKey in DPPFilterResultOptimize Scan the required partitions of t1 based on broadcast hash join keys t2.id < 2 Project Join Filter + Scan Filter + Scan Scan the required partitions of t2 t1.pKey in DPPFilterResult
  • 27.
    Dynamic Partition Pruning Scanall the partitions of t1 t2.id < 2 Project Join Filter + Scan Filter Scan t1.pKey = t2.pKey Scan the required partitions of t2 t1.pKey in DPPFilterResultOptimize Scan the required partitions of t1 t2.id < 2 Project Join Filter + Scan Filter + Scan Scan the required partitions of t2 t1.pKey in DPPFilterResult 90+% less file scan 33 X faster Optimize
  • 28.
    Dynamic Partition Pruning 60/ 102 TPC-DS queries: a speedup between 2x and 18x
  • 29.
    Performance Achieve high performancefor interactive, batch, streaming and ML workloads Adaptive Query Execution Dynamic Partition Pruning Join Hints Query Compilation Speedup
  • 30.
    Optimizer Hints ▪ Joinhints influence optimizer to choose the join strategies ▪ Broadcast hash join ▪ Sort-merge join ▪ Shuffle hash join ▪ Shuffle nested loop join ▪ Should be used with extreme caution. ▪ Difficult to manage over time. NEW NEW NEW
  • 31.
    ▪ Broadcast HashJoin SELECT /*+ BROADCAST(a) */ id FROM a JOIN b ON a.key = b.key ▪ Sort-Merge Join SELECT /*+ MERGE(a, b) */ id FROM a JOIN b ON a.key = b.key ▪ Shuffle Hash Join SELECT /*+ SHUFFLE_HASH(a, b) */ id FROM a JOIN b ON a.key = b.key ▪ Shuffle Nested Loop Join SELECT /*+ SHUFFLE_REPLICATE_NL(a, b) */ id FROM a JOIN b How to Use Join Hints?
  • 32.
    3.0: SQL Compatibility ANSIReserved Keywords ANSI Gregorian Calendar ANSI Store Assignment ANSI Overflow Checking ANSI SQL language dialect and broad support
  • 33.
    Python type hintsfor Pandas UDFs 3.0: Python Usability Old API
  • 35.
    3.0: Other Features StructuredStreaming UI Observable stream metrics SQL reference guide Data Source v2 API GPU scheduling support ...
  • 36.
    Enable new usecases and simplify the Spark application development Accelerator-aware Scheduler Built-in Functions pandas UDF enhancements DELETE/UPDATE/ MERGE in Catalyst Richer APIs
  • 37.
    Enable new usecases and simplify the Spark application development Accelerator-aware Scheduler Built-in Functions pandas UDF enhancements DELETE/UPDATE /MERGE in Catalyst Richer APIs
  • 38.
    Accelerator-aware Scheduling ▪ Widelyused for accelerating special workloads, e.g., deep learning and signal processing. ▪ Supports Standalone, YARN and K8S. ▪ Supports GPU now, FPGA, TPU, etc. in the future. ▪ Needs to specify required resources by configs ▪ Application level. Will support job/stage/task level in the future.
  • 39.
    Enable new usecases and simplify the Spark application development Richer APIs Accelerator-aware Scheduler Built-in Functions pandas UDF enhancements DELETE/UPDATE /MERGE in Catalyst
  • 40.
    32 New Built-inFunctions
  • 41.
    Enable new usecases and simplify the Spark application development Richer APIs Accelerator-aware Scheduler Built-in Functions pandas UDF enhancements DELETE/UPDATE/ MERGE in Catalyst
  • 42.
    Python UDF for SQL Python lambda functionsfor RDDs Session-speci fic Python UDF JAVA UDF in Python API New Pandas UDF Python Type Hints Py UDF V 3.0V 0.7 V 1.2 2013 2015 20182014 2019/20202016 2017 V 2.0 V 2.1 V 2.3/2.4 https://databricks.com/blog/2020/05/20/new-pandas-udfs-and-python-type-hints-in-the-upcoming-release-of-apache-spark-3-0.html
  • 43.
    Scalar Pandas UDF [pandas.Seriesto pandas.Series] SPARK 2.3 SPARK 3.0 Python Type Hints
  • 44.
    Grouped Map PandasFunction API [pandas.DataFrame to pandas.DataFrame] SPARK 2.3 SPARK 3.0 Python Type Hints
  • 45.
    Grouped Aggregate PandasUDF [pandas.Series to Scalar] SPARK 2.4 SPARK 3.0 Python Type Hints
  • 46.
  • 47.
    Map Pandas UDF CogroupedMap Pandas UDF New Pandas Function APIs
  • 48.
    Make monitoring anddebugging Spark applications more comprehensive and stable Structured Streaming UI DDL/DML Enhancements Observable Metrics Event Log Rollover Monitoring and Debuggability
  • 49.
    Make monitoring anddebugging Spark applications more comprehensive and stable Monitoring and Debuggability Structured Streaming UI DDL/DML Enhancements Observable Metrics Event Log Rollover
  • 50.
    Make monitoring anddebugging Spark applications more comprehensive and stable Monitoring and Debuggability Structured Streaming UI DDL/DML Enhancements Observable Metrics Event Log Rollover
  • 51.
    New Command EXPLAINFORMATTED *(1) Project [key#5, val#6] +- *(1) Filter (isnotnull(key#5) AND (key#5 = Subquery scalar-subquery#15, [id=#113]))    :  +- Subquery scalar-subquery#15, [id=#113]    :     +- *(2) HashAggregate(keys=[], functions=[max(key#21)])    :        +- Exchange SinglePartition, true, [id=#109]    :           +- *(1) HashAggregate(keys=[], functions=[partial_max(key#21)])    :              +- *(1) Project [key#21]    :                 +- *(1) Filter (isnotnull(val#22) AND (val#22 > 5))    :                    +- *(1) ColumnarToRow    :                       +- FileScan parquet default.tab2[key#21,val#22] Batched: true, DataFilters: [isnotnull(val#22), (val#22 > 5)], Format: Parquet, Location: InMemoryFileIndex[file:/user/hive/warehouse/tab2], PartitionFilters: [], PushedFilters: [IsNotNull(val), GreaterThan(val,5)], ReadSchema: struct<key:int,val:int>    +- *(1) ColumnarToRow       +- FileScan parquet default.tab1[key#5,val#6] Batched: true, DataFilters: [isnotnull(key#5)], Format: Parquet, Location: InMemoryFileIndex[file:/user/hive/warehouse/tab1], PartitionFilters: [], PushedFilters: [IsNotNull(key)], ReadSchema: struct<key:int,val:int> EXPLAIN FORMATTED SELECT * FROM tab1 WHERE key = (SELECT max(key) FROM tab2 WHERE val > 5
  • 52.
    * Project (4) +-* Filter (3)    +- * ColumnarToRow (2)       +- Scan parquet default.tab1 (1) (1) Scan parquet default.tab1  Output [2]: [key#5, val#6] Batched: true Location: InMemoryFileIndex [file:/user/hive/warehouse/tab1] PushedFilters: [IsNotNull(key)] ReadSchema: struct<key:int,val:int>       (2) ColumnarToRow [codegen id : 1] Input [2]: [key#5, val#6] (3) Filter [codegen id : 1] Input [2]: [key#5, val#6] Condition : (isnotnull(key#5) AND (key#5 = Subquery scalar-subquery#27, [id=#164]))       (4) Project [codegen id : 1] Output [2]: [key#5, val#6] Input [2]: [key#5, val#6] EXPLAIN FORMATTED SELECT * FROM tab1 WHERE key = (SELECT max(key) FROM tab2 WHERE val > 5
  • 53.
    (5) Scan parquetdefault.tab2  Output [2]: [key#21, val#22] Batched: true Location: InMemoryFileIndex [file:/user/hive/warehouse/tab2] PushedFilters: [IsNotNull(val), GreaterThan(val,5)] ReadSchema: struct<key:int,val:int> (6) ColumnarToRow [codegen id : 1] Input [2]: [key#21, val#22] (7) Filter [codegen id : 1] Input [2]: [key#21, val#22] Condition : (isnotnull(val#22) AND (val#22 > 5)) ===== Subqueries ===== Subquery:1 Hosting operator id = 3 Hosting Expression = Subquery scalar-subquery#27, [id=#164] * HashAggregate (11) +- Exchange (10)    +- * HashAggregate (9)       +- * Project (8)          +- * Filter (7)             +- * ColumnarToRow (6)                +- Scan parquet default.tab2 (5) (8) Project [codegen id : 1] Output [1]: [key#21] Input [2]: [key#21, val#22] (9) HashAggregate [codegen id : 1] Input [1]: [key#21] Keys: [] Functions [1]: [partial_max(key#21)] Aggregate Attributes [1]: [max#35] Results [1]: [max#36] (10) Exchange  Input [1]: [max#36] Arguments: SinglePartition, true, [id=#160] (11) HashAggregate [codegen id : 2] Input [1]: [max#36] Keys: [] Functions [1]: [max(key#21)] Aggregate Attributes [1]: [max(key#21)#33] Results [1]: [max(key#21)#33 AS max(key)#34]
  • 54.
    DDL/DML Enhancements Make monitoring anddebugging Spark applications more comprehensive and stable Monitoring and Debuggability Structured Streaming UI Observable Metrics Event Log Rollover
  • 55.
    A flexible wayto monitor data quality. Observable Metrics
  • 56.
    Reduce the timeand complexity of enabling applications that were written for other relational database products to run in Spark SQL. Reserved Keywords in Parser Proleptic Gregorian Calendar ANSI Store Assignment Overflow Checking SQL Compatibility
  • 57.
    Reduce the timeand complexity of enabling applications that were written for other relational database products to run in Spark SQL. Reserved Keywords in Parser Proleptic Gregorian Calendar ANSI Store Assignment Overflow Checking SQL Compatibility
  • 58.
    A safer wayto do table insertion and avoid bad data. ANSI store assignment + overflow check
  • 59.
    A safer wayto do table insertion and avoid bad data. ANSI store assignment + overflow check
  • 60.
    Enhance the performanceand functionalities of the built-in data sources Parquet/ORC Nested Column Pruning Parquet: Nested Column Filter Pushdown New Binary Data Source CSV Filter Pushdown Built-in Data Sources
  • 61.
    Enhance the performanceand functionalities of the built-in data sources Parquet/ORC Nested Column Pruning Parquet: Nested Column Filter Pushdown New Binary Data Source CSV Filter Pushdown Built-in Data Sources
  • 62.
    ▪ Skip readinguseless data blocks when only a few inner fields are selected. Better performance for nested fields
  • 63.
    ▪ Skip readinguseless data blocks when there are predicates with inner fields. Better performance for nested fields
  • 64.
    Improve the plug-ininterface and extend the deployment environments Data Source V2 API + Catalog Support Hive 3.x Metastore Hive 2.3 Execution Hadoop 3 Support JDK 11 Scala 2.12 Extensibility and Ecosystem
  • 65.
    Improve the plug-ininterface and extend the deployment environments Data Source V2 API + Catalog Support Hive 3.x Metastore Hive 2.3 Execution Hadoop 3 Support JDK 11 Support Extensibility and Ecosystem
  • 66.
    Catalog plugin API Userscan register customized catalogs and use Spark to access/manipulate table metadata directly. JDBC data source v2 is coming in Spark 3.1!
  • 67.
    To developers: Whento use Data Source V2? ▪ Pick V2 if you want to provide catalog functionalities, as V1 doesn’t have such ability. ▪ Pick V2 If you want to support both batch and streaming, as V1 have separated APIs for batch and streaming which makes it hard to reuse code. ▪ Pick V2 if you are sensitive to the scan performance, as V2 allows you to report data partitioning to skip shuffle, and implement vectorized reader for better performance. Data source v2 API is not as stable as V1!
  • 68.
    Improve the plug-ininterface and extend the deployment environments Data Source V2 API + Catalog Support Hive 3.x Metastore Hive 2.3 Execution Hadoop 3 Support JDK 11 Support Extensibility and Ecosystem
  • 69.
    Spark 3.0 Builds •Only builds with Scala 2.12 • Deprecates Python 2 (already EOL) • Can build with various Hadoop/Hive versions – Hadoop 2.7 + Hive 1.2 – Hadoop 2.7 + Hive 2.3 (supports Java 11)  [Default] – Hadoop 3.2 + Hive 2.3 (supports Java 11) • Supports the following Hive metastore versions: – "0.12", "0.13", "0.14", "1.0", "1.1", "1.2", "2.0", "2.1", "2.2", "2.3", "3.0", "3.1"
  • 70.
    32 New Built-inFunctions ▪ map Documentation • Web UI • SQL reference • Migration guide • Semantic versioning guidelines
  • 71.
    Adaptive Query Execution Dynamic Partition Pruning QueryCompilation Speedup Join Hints Performance Richer APIs Accelerator-aware Scheduler Built-in Functions pandas UDF Enhancements DELETE/UPDATE/ MERGE in Catalyst Reserved Keywords Proleptic Gregorian Calendar ANSI Store Assignment Overflow Checking SQL Compatibility Built-in Data Sources Parquet/ORC Nested Column Pruning Parquet: Nested Column Filter Pushdown CSV Filter Pushdown New Binary Data Source Data Source V2 API + Catalog Support JDK 11 Support Hadoop 3 Support Hive 3.x Metastore Hive 2.3 Execution Extensibility and Ecosystem Structured Streaming UI DDL/DML Enhancements Observable Metrics Event Log Rollover Monitoring and Debuggability
  • 72.
  • 73.
    Managing the MLLifecycle
  • 76.
    MLflow joins theLinux Foundation +
  • 77.
  • 78.
  • 79.
  • 80.
    Autologging on Databricks(coming soon) Recreate exact configuration of an experiment run
  • 81.
    Model Registry -Model Signature (1.9)
  • 82.
    Model Registry -Custom tags (coming soon) Tags & search in the Model Registry to track custom metadata APIs to integrate automatic CI/CD
  • 83.
    Model Deployment -APIs (coming soon)
  • 84.
    Model Serving onDatabricks (coming soon)
  • 85.
    Model Serving onDatabricks (coming soon)
  • 86.
    What is Koalas? Implementationof Pandas APIs over Spark ▪ Easily port existing data science code Launched at Spark+AI Summit 2019 Now up to 850,000 downloads per month (1/5th of PySpark!) import databricks.koalas as ks df = ks.read_csv(file) df[‘x’] = df.y * df.z df.describe() df.plot.line(...)
  • 87.
    Announcing Koalas 1.0! Closeto 80% API coverage Faster performance with Spark 3.0 APIs More support for missing values, NA, and in-place updates Faster distributed index type pip install koalas to get started! 20% faster Time(seconds) 26% faster
  • 88.
  • 89.
    Data Warehouses were purpose-built forBI and reporting, however… ▪ No support for video, audio, text ▪ No support for data science, ML ▪ Limited support for streaming ▪ Closed & proprietary formats Therefore, most data is stored in data lakes & blob stores ETL External Data Operational Data Data Warehouses BI Reports
  • 90.
    Data Lakes could handleall your data for data science and ML, however… ▪ Poor BI support ▪ Complex to set up and maintain ▪ Poor performance ▪ Unreliable data swamps BIData Science Machine Learning Structured, Semi-Structured and Unstructured Data Data Lake Real-Time Database Reports Data Warehouses Data Prep and Validation ETL
  • 91.
    Coexistence is nota desirable strategy BIData Science Machine Learning Structured, Semi-Structured and Unstructured Data Data Lake Real-Time Database Reports Data Warehouses Data Prep and Validation ETL ETL External Data Operational Data Data Warehouses BI Reports
  • 92.
    Data Lakes areKey for Analytics Data Lake Data Science and ML • Recommendation Engines • Risk, Fraud, & Intrusion Detection • Customer Analytics • IoT & Predictive Maintenance • Genomics & DNA Sequencing
  • 93.
    Data Lake Unreliable LowQuality Data Slow Performance Data Science and ML • Recommendation Engines • Risk, Fraud, & Intrusion Detection • Customer Analytics • IoT & Predictive Maintenance • Genomics & DNA Sequencing > 65% big data projects fail per Gartner X Data Lakes are Key for Analytics
  • 94.
    The Lakehouse paradigm DataWarehouse Data Lake Streaming Analytics BI Data Science Machine Learning Structured, Semi-Structured and Unstructured Data
  • 95.
    Streaming Analytics BI Data Science Machine Learning Structured, Semi-Structuredand Unstructured Data Data Lake for all your data One platform for every use case Structured transactional layer The Lakehouse paradigm
  • 96.
    Streaming Analytics BI Data Science Machine Learning Structured, Semi-Structuredand Unstructured Data Data Lake for all your data One platform for every use case Structured transactional layer The Lakehouse paradigm
  • 97.
    Streaming Analytics BI Data Science Machine Learning Structured, Semi-Structuredand Unstructured Data DELTA ENGINE Data Lake for all your data One platform for every use case Structured transactional layer The Lakehouse paradigm
  • 98.
    A new standardfor building data lakes An opinionated approach to Data Lakes ■ Brings reliability, quality and performance to Data Lakes ■ Provides ACID transactions, scalable metadata handling, and unifies streaming and batch data processing. ■ Fully compatible with Spark APIs. ■ Based on open source and open format (Parquet) 0.7.0 released! Support for SQL Delete, Update and Merge - With Spark 3.0 Lot of other improvements...
  • 99.
    1. Hard toappend data 2. Modification of existing data difficult 3. Jobs failing mid way 4. Complex real-time operations 5. Costly to keep historical data versions 6. Difficult to handle large metadata 7. “Too many files” problems 8. Poor performance 9. Data quality issues Data Lake Challenges
  • 100.
    1. Hard toappend data 2. Modification of existing data difficult 3. Jobs failing mid way 4. Complex real-time operations 5. Costly to keep historical data versions 6. Difficult to handle large metadata 7. “Too many files” problems 8. Poor performance 9. Data quality issues Data Lake Challenges Schema validation & expectations Indexing & compaction Spark under the hood ACID transactions
  • 101.
    Building a CuratedData Lake Expectations enable creating a curated Data Lake Raw Ingestion and History BRONZE Filtered, Cleaned, Augmented SILVER Business-level Aggregates GOLD
  • 102.
    Getting started withDelta is very easy! CREATE TABLE ... USING parquet CONVERT TO DELTA events CREATE TABLE ... USING delta Create Tables using SQL syntax Migrate Existing Tables
  • 103.
    Lakehouse Streaming Analytics BI Data Science Machine Learning Structured, Semi-Structuredand Unstructured Data Data Lake for all your data One platform for every use case Structured transactional layer
  • 104.
  • 106.