Hyperspace is an extensible indexing subsystem for Apache Spark, developed by Microsoft's Spark team, which enhances query performance by up to 10x and supports various languages including Scala, Python, and .NET. It is designed for use with Delta Lake, enabling efficient index maintenance and hybrid scan capabilities while ensuring compliance with data privacy regulations like GDPR. The open-sourced project has garnered contributions from the community and integrated features such as snapshot isolation and time travel indexing.
Rahul Potharaju
Principal SoftwareEngineering Manager @Microsoft
Part of the Spark team at Microsoft
Azure Synapse Analytics
OSS: Hyperspace, .NET for Apache Spark
You can also find me publishing in VLDB, NSDI etc.
Terry Kim
Principal Software Engineer @Microsoft
Part of the Spark team at Microsoft
Azure Synapse Analytics
OSS: Hyperspace, .NET for Apache Spark,
Apache Spark
Eunjin Song
Senior Software Engineer @Microsoft
Part of the Spark team at Microsoft
Azure Synapse Analytics
OSS: Hyperspace
4.
We work on
everythingApache SparkTM
Spark Runtime, Spark Service, HW Acc
in Synapse, Debugging & Diagnostics
Offer Apache
SparkTM
-as-a-service to
Microsoft customers
Runtimes for Synapse Spark, HDInsight
Spark, Spark on Cosmos
Contribute back to
Apache SparkTM
Spark SQL, Datasource v2, #47/1600
Spark contributor
We open source
our work!
Hyperspace, .NET for Spark
Hyperspace in
a Nutshell
Simple
UsageAPI
// Index Maintenance
createIndex(df: DataFrame,
indexCfg: IndexConfig): Unit
deleteIndex(indexName: String): Unit
restoreIndex(indexName: String): Unit
vacuumIndex(indexName: String): Unit
refreshIndex(indexName: String): Unit
cancel(indexName: String): Unit
Language Choices
Scala
Python
.NET
New extensible indexing subsystem for
Apache Spark
Same technology that powers the indexing
engine inside Azure Synapse Analytics
Works out-of-box with open-source
Apache Spark
Accelerated performance on key workloads
Highlights
8.
Hyperspace
Use Cases
High-Concurrency InteractiveAnalytics and Data Export
Indexing Privacy Attributes for GDPR Compliance
Time-series Analytics
Framework for Derived Dataset Maintenance
Needle-in-a-haystack Queries
Top User Requestfrom
Spark+AI Summit 2020 & Microsoft Customers
“Will Hyperspace work
for Delta Lake?”
Yes – giving back to the community like this makes me
believe Microsoft is a very different company than twenty
years ago. Definitely, a culture I would enjoy! Thank you!!! Great presentation. Just finished adding our
own secondary indexing schema for training
selection. Lots of common threads here.
Good to see Microsoft
contributing to the community
with such awesome work.
Kudos to the team.
Thank you, quite interesting
Great stuff. Thanks!
Awesome work!
Great stuff guys!
This is cool!
Very interesting presentation!
1 2 34 5
Data
1 2 3
Index
Query
Full Refresh
• Slowest refresh/fastest query
• Rebuilds the entire index
Incremental Refresh
• Slow refresh/fast query
• Builds index on newly added
files/partitions
• Drops rows from index
immediately 1 2 3 4 5
Data
1 2 3
Index
Query
.refresh(full)
4 5 6
6 7 8 9 10
6 7 8 9 10
4 5 6
.refresh(incremental)
Query
Query
Index
Maintenance
Quick Refresh
• Fastest refresh/fast query
• Captures meta-data for
appended and file/partition
predicates for deletes
• Leverage Hybrid Scan at
runtime 1 2 3 4 5
Data
1 2 3
Index
Query
6 7 8 9 10
.refresh(incremental) –
Only updates meta-data
Query
Model Assumptions:
Appends and Deletes are done at
file or partition level i.e., in-place
updates are not supported
Index was constructed with
lineage information
13.
Query Processing:
Hybrid
Scan
Data 12 3 4 5 6 1 2 X X 5 6 7 8
Initial Dataset
Created at v1
1 2 X 4 5 6
Index
Created
I1
I1
Dataset
Updated to v2
time
Query on
Dataset at v2
What does Hybrid Scan entail?
Table A Table B Table B
Ս
σfile != 4
Index Scan using I1
as of Dataset v1
7 8
Shuffle
Step 1: Compute diff since indexed
[
{4, deleted},
{7, added},
{8, added}
]
Step 2: Rewrite Table Scan as Hybrid Scan
14.
Indexing Support for
ACID
Data
Formats
Snapshotisolation
Readers use a consistent
snapshot of the table (no
locks). All table updates are
atomic.
Distributed planning
File pruning and predicate
push-down is distributed to
jobs, removing the metastore
as a bottleneck.
Version history, rollback
and time travel
Table snapshots are kept as
history and tables can roll
back if a job produces bad
data
Indexing Support for
DeltaLake
Time Travel
Hyperspace
create
v1
refresh
v2
refresh
v3
User queries
snapshot at v4
Hyperspace chooses hybrid scan over:
- Index(v1) + Scan(DeltaLake(v4-v3))
User queries
snapshot at v6
Delta Lake
add
v1
add
v2
del
v3
add
v4
add
v5
add
v6
add
v7
Hyperspace compares cost of hybrid scan over:
- Index(v2) + Scan(DeltaLake(v6-v5))
- Index(v3) + Scan(DeltaLake(v6-v7))
Hyperspace chooses:
- Index(v1)
User queries
snapshot at v3
17.
Azure Synapse Analytics
offersthe best offering of Hyperspace’s indexing yet!
• No additional JAR includes
• Fastest access to latest features
• Support for Scala | Python | .NET
• Seamless integration with the UI
• Meta-store integration
• Notebooks for faster iterations
Open Sourcing Hyperspacev0.1
Conclusion
New extensible indexing subsystem for
Apache Spark
Simply add on—no core changes needed
Same technology that powers the indexing
engine inside Azure Synapse Analytics
Works out-of-box with open-source
Apache Spark
Scala, Python, and .NET support
Accelerated performance on key workloads
Up to 10x query performance
improvement
https://github.com/microsoft/hyperspace
Open Sourced
It is not perfect… but fully open to contributions
towards being made perfect! ☺
@RahulPotharaju