Hyperspace for
Delta Lake
Rahul Potharaju, Terry Kim, Eunjin Song
Microsoft
@RahulPotharaju
Who?
Rahul Potharaju
Principal Software Engineering Manager @Microsoft
Part of the Spark team at Microsoft
Azure Synapse Analytics
OSS: Hyperspace, .NET for Apache Spark
You can also find me publishing in VLDB, NSDI etc.
Terry Kim
Principal Software Engineer @Microsoft
Part of the Spark team at Microsoft
Azure Synapse Analytics
OSS: Hyperspace, .NET for Apache Spark,
Apache Spark
Eunjin Song
Senior Software Engineer @Microsoft
Part of the Spark team at Microsoft
Azure Synapse Analytics
OSS: Hyperspace
We work on
everything Apache SparkTM
Spark Runtime, Spark Service, HW Acc
in Synapse, Debugging & Diagnostics
Offer Apache
SparkTM
-as-a-service to
Microsoft customers
Runtimes for Synapse Spark, HDInsight
Spark, Spark on Cosmos
Contribute back to
Apache SparkTM
Spark SQL, Datasource v2, #47/1600
Spark contributor
We open source
our work!
Hyperspace, .NET for Spark
Agenda
▪ Rahul Potharaju
▪ Background, Concepts,
Conclusion
▪ Terry Kim
▪ Demo, Performance Deep-dive
What is Hyperspace?
Hyperspace in
a Nutshell
Simple
Usage API
// Index Maintenance
createIndex(df: DataFrame,
indexCfg: IndexConfig): Unit
deleteIndex(indexName: String): Unit
restoreIndex(indexName: String): Unit
vacuumIndex(indexName: String): Unit
refreshIndex(indexName: String): Unit
cancel(indexName: String): Unit
Language Choices
Scala
Python
.NET
New extensible indexing subsystem for
Apache Spark
Same technology that powers the indexing
engine inside Azure Synapse Analytics
Works out-of-box with open-source
Apache Spark
Accelerated performance on key workloads
Highlights
Hyperspace
Use Cases
High-Concurrency Interactive Analytics and Data Export
Indexing Privacy Attributes for GDPR Compliance
Time-series Analytics
Framework for Derived Dataset Maintenance
Needle-in-a-haystack Queries
https://aka.ms/HyperspaceIntroTalk https://aka.ms/Hyperspace
16
Project
Contributors
213
Pull Requests
Merged
263
GitHub
Stars
180
Issues
Reported
Up to 10x query performance
improvement
Open-Sourced
@Spark+AI
Summit 2020
https://aka.ms/Hyperspace-Blog https://github.com/microsoft/hyperspace
Top User Request from
Spark+AI Summit 2020 & Microsoft Customers
“Will Hyperspace work
for Delta Lake?”
Yes – giving back to the community like this makes me
believe Microsoft is a very different company than twenty
years ago. Definitely, a culture I would enjoy! Thank you!!! Great presentation. Just finished adding our
own secondary indexing schema for training
selection. Lots of common threads here.
Good to see Microsoft
contributing to the community
with such awesome work.
Kudos to the team.
Thank you, quite interesting
Great stuff. Thanks!
Awesome work!
Great stuff guys!
This is cool!
Very interesting presentation!
Hyperspace for Delta Lake
Index
Maintenance
Hybrid
Scan
ACID Data
Formats
1 2 3 4 5
Data
1 2 3
Index
Query
Full Refresh
• Slowest refresh/fastest query
• Rebuilds the entire index
Incremental Refresh
• Slow refresh/fast query
• Builds index on newly added
files/partitions
• Drops rows from index
immediately 1 2 3 4 5
Data
1 2 3
Index
Query
.refresh(full)
4 5 6
6 7 8 9 10
6 7 8 9 10
4 5 6
.refresh(incremental)
Query
Query
Index
Maintenance
Quick Refresh
• Fastest refresh/fast query
• Captures meta-data for
appended and file/partition
predicates for deletes
• Leverage Hybrid Scan at
runtime 1 2 3 4 5
Data
1 2 3
Index
Query
6 7 8 9 10
.refresh(incremental) –
Only updates meta-data
Query
Model Assumptions:
Appends and Deletes are done at
file or partition level i.e., in-place
updates are not supported
Index was constructed with
lineage information
Query Processing:
Hybrid
Scan
Data 1 2 3 4 5 6 1 2 X X 5 6 7 8
Initial Dataset
Created at v1
1 2 X 4 5 6
Index
Created
I1
I1
Dataset
Updated to v2
time
Query on
Dataset at v2
What does Hybrid Scan entail?
Table A Table B Table B
Ս
σfile != 4
Index Scan using I1
as of Dataset v1
7 8
Shuffle
Step 1: Compute diff since indexed
[
{4, deleted},
{7, added},
{8, added}
]
Step 2: Rewrite Table Scan as Hybrid Scan
Indexing Support for
ACID
Data
Formats
Snapshot isolation
Readers use a consistent
snapshot of the table (no
locks). All table updates are
atomic.
Distributed planning
File pruning and predicate
push-down is distributed to
jobs, removing the metastore
as a bottleneck.
Version history, rollback
and time travel
Table snapshots are kept as
history and tables can roll
back if a job produces bad
data
Delta Lake
Time
Travel
Delta Lake
add
v1
add
v2
del
v3
add
v4
add
v5
add
v6
add
v7
1
2
3
1
2
3
4
1
2
3
4
1
2
3
4
5
1
2
3
4
5
6
1
2
3
4
5
6
1
2
3
4
5
6
Files in the
directory
Spark Code for
reading Delta
Lake Table
Current
default
version
val df = spark
.read
.format("delta")
.load(deltaTablePath)
df.show()
User
queries
v2
val df = spark
.read
.format("delta")
.option("versionAsOf", 2)
.load(deltaTablePath)
df.show()
Spark Code for
Delta Lake Time
Travel
Indexing Support for
Delta Lake
Time Travel
Hyperspace
create
v1
refresh
v2
refresh
v3
User queries
snapshot at v4
Hyperspace chooses hybrid scan over:
- Index(v1) + Scan(DeltaLake(v4-v3))
User queries
snapshot at v6
Delta Lake
add
v1
add
v2
del
v3
add
v4
add
v5
add
v6
add
v7
Hyperspace compares cost of hybrid scan over:
- Index(v2) + Scan(DeltaLake(v6-v5))
- Index(v3) + Scan(DeltaLake(v6-v7))
Hyperspace chooses:
- Index(v1)
User queries
snapshot at v3
Azure Synapse Analytics
offers the best offering of Hyperspace’s indexing yet!
• No additional JAR includes
• Fastest access to latest features
• Support for Scala | Python | .NET
• Seamless integration with the UI
• Meta-store integration
• Notebooks for faster iterations
Experience of Using
Hyperspace for Delta Lake
Notebook: https://aka.ms/hyperspace-for-delta-lake
Preliminary
Performance
Evaluation of
Hyperspace for
Delta Lake
Compute Configuration:
• VM Instance = Azure E8 V3
• Workers/Executors = 7
• Cores per executors = 8
• Executor memory = 47 GB
• Autoscale disabled
• ADLS Gen v2
Experimental Setting
TPC-DS:
store_sales
1 2 3 4 5 6 200
…
1 GB
TPC-DS:
items
1 2 3 4 5 6 200
…
200 KB
200 GB
40 MB
Performance of Hyperspace
using TPC-DS Q44
201 250
…
Append more data
Performance of Hyperspace
1. Without refreshing the index
2. After refreshing the index
Performance Implications of Using
Hyperspace for Delta Lake
Open Sourcing Hyperspace v0.1
Conclusion
New extensible indexing subsystem for
Apache Spark
Simply add on—no core changes needed
Same technology that powers the indexing
engine inside Azure Synapse Analytics
Works out-of-box with open-source
Apache Spark
Scala, Python, and .NET support
Accelerated performance on key workloads
Up to 10x query performance
improvement
https://github.com/microsoft/hyperspace
Open Sourced
It is not perfect… but fully open to contributions
towards being made perfect! ☺
@RahulPotharaju

Hyperspace for Delta Lake

  • 1.
    Hyperspace for Delta Lake RahulPotharaju, Terry Kim, Eunjin Song Microsoft @RahulPotharaju
  • 2.
  • 3.
    Rahul Potharaju Principal SoftwareEngineering Manager @Microsoft Part of the Spark team at Microsoft Azure Synapse Analytics OSS: Hyperspace, .NET for Apache Spark You can also find me publishing in VLDB, NSDI etc. Terry Kim Principal Software Engineer @Microsoft Part of the Spark team at Microsoft Azure Synapse Analytics OSS: Hyperspace, .NET for Apache Spark, Apache Spark Eunjin Song Senior Software Engineer @Microsoft Part of the Spark team at Microsoft Azure Synapse Analytics OSS: Hyperspace
  • 4.
    We work on everythingApache SparkTM Spark Runtime, Spark Service, HW Acc in Synapse, Debugging & Diagnostics Offer Apache SparkTM -as-a-service to Microsoft customers Runtimes for Synapse Spark, HDInsight Spark, Spark on Cosmos Contribute back to Apache SparkTM Spark SQL, Datasource v2, #47/1600 Spark contributor We open source our work! Hyperspace, .NET for Spark
  • 5.
    Agenda ▪ Rahul Potharaju ▪Background, Concepts, Conclusion ▪ Terry Kim ▪ Demo, Performance Deep-dive
  • 6.
  • 7.
    Hyperspace in a Nutshell Simple UsageAPI // Index Maintenance createIndex(df: DataFrame, indexCfg: IndexConfig): Unit deleteIndex(indexName: String): Unit restoreIndex(indexName: String): Unit vacuumIndex(indexName: String): Unit refreshIndex(indexName: String): Unit cancel(indexName: String): Unit Language Choices Scala Python .NET New extensible indexing subsystem for Apache Spark Same technology that powers the indexing engine inside Azure Synapse Analytics Works out-of-box with open-source Apache Spark Accelerated performance on key workloads Highlights
  • 8.
    Hyperspace Use Cases High-Concurrency InteractiveAnalytics and Data Export Indexing Privacy Attributes for GDPR Compliance Time-series Analytics Framework for Derived Dataset Maintenance Needle-in-a-haystack Queries
  • 9.
    https://aka.ms/HyperspaceIntroTalk https://aka.ms/Hyperspace 16 Project Contributors 213 Pull Requests Merged 263 GitHub Stars 180 Issues Reported Upto 10x query performance improvement Open-Sourced @Spark+AI Summit 2020 https://aka.ms/Hyperspace-Blog https://github.com/microsoft/hyperspace
  • 10.
    Top User Requestfrom Spark+AI Summit 2020 & Microsoft Customers “Will Hyperspace work for Delta Lake?” Yes – giving back to the community like this makes me believe Microsoft is a very different company than twenty years ago. Definitely, a culture I would enjoy! Thank you!!! Great presentation. Just finished adding our own secondary indexing schema for training selection. Lots of common threads here. Good to see Microsoft contributing to the community with such awesome work. Kudos to the team. Thank you, quite interesting Great stuff. Thanks! Awesome work! Great stuff guys! This is cool! Very interesting presentation!
  • 11.
    Hyperspace for DeltaLake Index Maintenance Hybrid Scan ACID Data Formats
  • 12.
    1 2 34 5 Data 1 2 3 Index Query Full Refresh • Slowest refresh/fastest query • Rebuilds the entire index Incremental Refresh • Slow refresh/fast query • Builds index on newly added files/partitions • Drops rows from index immediately 1 2 3 4 5 Data 1 2 3 Index Query .refresh(full) 4 5 6 6 7 8 9 10 6 7 8 9 10 4 5 6 .refresh(incremental) Query Query Index Maintenance Quick Refresh • Fastest refresh/fast query • Captures meta-data for appended and file/partition predicates for deletes • Leverage Hybrid Scan at runtime 1 2 3 4 5 Data 1 2 3 Index Query 6 7 8 9 10 .refresh(incremental) – Only updates meta-data Query Model Assumptions: Appends and Deletes are done at file or partition level i.e., in-place updates are not supported Index was constructed with lineage information
  • 13.
    Query Processing: Hybrid Scan Data 12 3 4 5 6 1 2 X X 5 6 7 8 Initial Dataset Created at v1 1 2 X 4 5 6 Index Created I1 I1 Dataset Updated to v2 time Query on Dataset at v2 What does Hybrid Scan entail? Table A Table B Table B Ս σfile != 4 Index Scan using I1 as of Dataset v1 7 8 Shuffle Step 1: Compute diff since indexed [ {4, deleted}, {7, added}, {8, added} ] Step 2: Rewrite Table Scan as Hybrid Scan
  • 14.
    Indexing Support for ACID Data Formats Snapshotisolation Readers use a consistent snapshot of the table (no locks). All table updates are atomic. Distributed planning File pruning and predicate push-down is distributed to jobs, removing the metastore as a bottleneck. Version history, rollback and time travel Table snapshots are kept as history and tables can roll back if a job produces bad data
  • 15.
    Delta Lake Time Travel Delta Lake add v1 add v2 del v3 add v4 add v5 add v6 add v7 1 2 3 1 2 3 4 1 2 3 4 1 2 3 4 5 1 2 3 4 5 6 1 2 3 4 5 6 1 2 3 4 5 6 Filesin the directory Spark Code for reading Delta Lake Table Current default version val df = spark .read .format("delta") .load(deltaTablePath) df.show() User queries v2 val df = spark .read .format("delta") .option("versionAsOf", 2) .load(deltaTablePath) df.show() Spark Code for Delta Lake Time Travel
  • 16.
    Indexing Support for DeltaLake Time Travel Hyperspace create v1 refresh v2 refresh v3 User queries snapshot at v4 Hyperspace chooses hybrid scan over: - Index(v1) + Scan(DeltaLake(v4-v3)) User queries snapshot at v6 Delta Lake add v1 add v2 del v3 add v4 add v5 add v6 add v7 Hyperspace compares cost of hybrid scan over: - Index(v2) + Scan(DeltaLake(v6-v5)) - Index(v3) + Scan(DeltaLake(v6-v7)) Hyperspace chooses: - Index(v1) User queries snapshot at v3
  • 17.
    Azure Synapse Analytics offersthe best offering of Hyperspace’s indexing yet! • No additional JAR includes • Fastest access to latest features • Support for Scala | Python | .NET • Seamless integration with the UI • Meta-store integration • Notebooks for faster iterations
  • 18.
    Experience of Using Hyperspacefor Delta Lake Notebook: https://aka.ms/hyperspace-for-delta-lake
  • 19.
    Preliminary Performance Evaluation of Hyperspace for DeltaLake Compute Configuration: • VM Instance = Azure E8 V3 • Workers/Executors = 7 • Cores per executors = 8 • Executor memory = 47 GB • Autoscale disabled • ADLS Gen v2 Experimental Setting TPC-DS: store_sales 1 2 3 4 5 6 200 … 1 GB TPC-DS: items 1 2 3 4 5 6 200 … 200 KB 200 GB 40 MB Performance of Hyperspace using TPC-DS Q44 201 250 … Append more data Performance of Hyperspace 1. Without refreshing the index 2. After refreshing the index
  • 20.
    Performance Implications ofUsing Hyperspace for Delta Lake
  • 21.
    Open Sourcing Hyperspacev0.1 Conclusion New extensible indexing subsystem for Apache Spark Simply add on—no core changes needed Same technology that powers the indexing engine inside Azure Synapse Analytics Works out-of-box with open-source Apache Spark Scala, Python, and .NET support Accelerated performance on key workloads Up to 10x query performance improvement https://github.com/microsoft/hyperspace Open Sourced It is not perfect… but fully open to contributions towards being made perfect! ☺ @RahulPotharaju