Hyperspace for Delta Lake

Hyperspace for
Delta Lake
Rahul Potharaju, Terry Kim, Eunjin Song
Microsoft
@RahulPotharaju

Rahul Potharaju
Principal Software Engineering Manager @Microsoft
Part of the Spark team at Microsoft
Azure Synapse Analytics
OSS: Hyperspace, .NET for Apache Spark
You can also ﬁnd me publishing in VLDB, NSDI etc.
Terry Kim
Principal Software Engineer @Microsoft
OSS: Hyperspace, .NET for Apache Spark,
Apache Spark
Eunjin Song
Senior Software Engineer @Microsoft
OSS: Hyperspace

We work on
everything Apache SparkTM
Spark Runtime, Spark Service, HW Acc
in Synapse, Debugging & Diagnostics
Oﬀer Apache
SparkTM
-as-a-service to
Microsoft customers
Runtimes for Synapse Spark, HDInsight
Spark, Spark on Cosmos
Contribute back to
Apache SparkTM
Spark SQL, Datasource v2, #47/1600
Spark contributor
We open source
our work!
Hyperspace, .NET for Spark

Agenda
▪ Rahul Potharaju
▪ Background, Concepts,
Conclusion
▪ Terry Kim
▪ Demo, Performance Deep-dive

Hyperspace in
a Nutshell
Simple
Usage API
// Index Maintenance
createIndex(df: DataFrame,
indexCfg: IndexConﬁg): Unit
deleteIndex(indexName: String): Unit
restoreIndex(indexName: String): Unit
vacuumIndex(indexName: String): Unit
refreshIndex(indexName: String): Unit
cancel(indexName: String): Unit
Language Choices
Scala
Python
.NET
New extensible indexing subsystem for
Apache Spark
Same technology that powers the indexing
engine inside Azure Synapse Analytics
Works out-of-box with open-source
Apache Spark
Accelerated performance on key workloads
Highlights

Hyperspace
Use Cases
High-Concurrency Interactive Analytics and Data Export
Indexing Privacy Attributes for GDPR Compliance
Time-series Analytics
Framework for Derived Dataset Maintenance
Needle-in-a-haystack Queries

https://aka.ms/HyperspaceIntroTalk https://aka.ms/Hyperspace
16
Project
Contributors
213
Pull Requests
Merged
263
GitHub
Stars
180
Issues
Reported
Up to 10x query performance
improvement
Open-Sourced
@Spark+AI
Summit 2020
https://aka.ms/Hyperspace-Blog https://github.com/microsoft/hyperspace

Top User Request from
Spark+AI Summit 2020 & Microsoft Customers
“Will Hyperspace work
for Delta Lake?”
Yes – giving back to the community like this makes me
believe Microsoft is a very different company than twenty
years ago. Definitely, a culture I would enjoy! Thank you!!! Great presentation. Just finished adding our
own secondary indexing schema for training
selection. Lots of common threads here.
Good to see Microsoft
contributing to the community
with such awesome work.
Kudos to the team.
Thank you, quite interesting
Great stuff. Thanks!
Awesome work!
Great stuff guys!
This is cool!
Very interesting presentation!

Hyperspace for Delta Lake
Index
Maintenance
Hybrid
Scan
ACID Data
Formats

1 2 3 4 5
Data
1 2 3
Index
Query
Full Refresh
• Slowest refresh/fastest query
• Rebuilds the entire index
Incremental Refresh
• Slow refresh/fast query
• Builds index on newly added
files/partitions
• Drops rows from index
immediately 1 2 3 4 5
Data
1 2 3
Index
Query
.refresh(full)
4 5 6
6 7 8 9 10
6 7 8 9 10
4 5 6
.refresh(incremental)
Query
Query
Index
Maintenance
Quick Refresh
• Fastest refresh/fast query
• Captures meta-data for
appended and file/partition
predicates for deletes
• Leverage Hybrid Scan at
runtime 1 2 3 4 5
Data
1 2 3
Index
Query
6 7 8 9 10
.refresh(incremental) –
Only updates meta-data
Query
Model Assumptions:
Appends and Deletes are done at
ﬁle or partition level i.e., in-place
updates are not supported
Index was constructed with
lineage information

Query Processing:
Hybrid
Scan
Data 1 2 3 4 5 6 1 2 X X 5 6 7 8
Initial Dataset
Created at v1
1 2 X 4 5 6
Index
Created
I1
I1
Dataset
Updated to v2
time
Query on
Dataset at v2
What does Hybrid Scan entail?
Table A Table B Table B
Ս
σfile != 4
Index Scan using I1
as of Dataset v1
7 8
Shuﬄe
Step 1: Compute diff since indexed
[
{4, deleted},
{7, added},
{8, added}
]
Step 2: Rewrite Table Scan as Hybrid Scan

Indexing Support for
ACID
Data
Formats
Snapshot isolation
Readers use a consistent
snapshot of the table (no
locks). All table updates are
atomic.
Distributed planning
File pruning and predicate
push-down is distributed to
jobs, removing the metastore
as a bottleneck.
Version history, rollback
and time travel
Table snapshots are kept as
history and tables can roll
back if a job produces bad
data

Delta Lake
Time
Travel
Delta Lake
add
v1
add
v2
del
v3
add
v4
add
v5
add
v6
add
v7
1
2
3
1
2
3
4
1
2
3
4
1
2
3
4
5
1
2
3
4
5
6
1
2
3
4
5
6
1
2
3
4
5
6
Files in the
directory
Spark Code for
reading Delta
Lake Table
Current
default
version
val df = spark
.read
.format("delta")
.load(deltaTablePath)
df.show()
User
queries
v2
val df = spark
.read
.format("delta")
.option("versionAsOf", 2)
.load(deltaTablePath)
df.show()
Spark Code for
Delta Lake Time
Travel

Indexing Support for
Delta Lake
Time Travel
Hyperspace
create
v1
refresh
v2
refresh
v3
User queries
snapshot at v4
Hyperspace chooses hybrid scan over:
- Index(v1) + Scan(DeltaLake(v4-v3))
User queries
snapshot at v6
Delta Lake
add
v1
add
v2
del
v3
add
v4
add
v5
add
v6
add
v7
Hyperspace compares cost of hybrid scan over:
Hyperspace chooses:
- Index(v1)
User queries
snapshot at v3

oﬀers the best oﬀering of Hyperspace’s indexing yet!
• No additional JAR includes
• Fastest access to latest features
• Support for Scala | Python | .NET
• Seamless integration with the UI
• Meta-store integration
• Notebooks for faster iterations

Experience of Using
Notebook: https://aka.ms/hyperspace-for-delta-lake

Preliminary
Performance
Evaluation of
Hyperspace for
Delta Lake
Compute Conﬁguration:
• VM Instance = Azure E8 V3
• Workers/Executors = 7
• Cores per executors = 8
• Executor memory = 47 GB
• Autoscale disabled
• ADLS Gen v2
Experimental Setting
TPC-DS:
store_sales
1 2 3 4 5 6 200
…
1 GB
TPC-DS:
items
1 2 3 4 5 6 200
…
200 KB
200 GB
40 MB
Performance of Hyperspace
using TPC-DS Q44
201 250
…
Append more data
Performance of Hyperspace
1. Without refreshing the index
2. After refreshing the index

Performance Implications of Using

Open Sourcing Hyperspace v0.1
Conclusion
New extensible indexing subsystem for
Apache Spark
Simply add on—no core changes needed
Same technology that powers the indexing
engine inside Azure Synapse Analytics
Works out-of-box with open-source
Apache Spark
Scala, Python, and .NET support
Accelerated performance on key workloads
Up to 10x query performance
improvement
https://github.com/microsoft/hyperspace
Open Sourced
It is not perfect… but fully open to contributions
towards being made perfect! ☺
@RahulPotharaju

Hyperspace for Delta Lake

In this document

More Related Content

What's hot

Similar to Hyperspace for Delta Lake

More from Databricks

Recently uploaded

Hyperspace for Delta Lake