Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Hyperspace for
Delta Lake
Rahul Potharaju, Terry Kim, Eunjin Song
Microsoft
@RahulPotharaju
Who?
Rahul Potharaju
Principal Software Engineering Manager @Microsoft
Part of the Spark team at Microsoft
Azure Synapse Analyt...
We work on
everything Apache SparkTM
Spark Runtime, Spark Service, HW Acc
in Synapse, Debugging & Diagnostics
Offer Apache
...
Agenda
▪ Rahul Potharaju
▪ Background, Concepts,
Conclusion
▪ Terry Kim
▪ Demo, Performance Deep-dive
What is Hyperspace?
Hyperspace in
a Nutshell
Simple
Usage API
// Index Maintenance
createIndex(df: DataFrame,
indexCfg: IndexConfig): Unit
dele...
Hyperspace
Use Cases
High-Concurrency Interactive Analytics and Data Export
Indexing Privacy Attributes for GDPR Complianc...
https://aka.ms/HyperspaceIntroTalk https://aka.ms/Hyperspace
16
Project
Contributors
213
Pull Requests
Merged
263
GitHub
S...
Top User Request from
Spark+AI Summit 2020 & Microsoft Customers
“Will Hyperspace work
for Delta Lake?”
Yes – giving back ...
Hyperspace for Delta Lake
Index
Maintenance
Hybrid
Scan
ACID Data
Formats
1 2 3 4 5
Data
1 2 3
Index
Query
Full Refresh
• Slowest refresh/fastest query
• Rebuilds the entire index
Incremental Refr...
Query Processing:
Hybrid
Scan
Data 1 2 3 4 5 6 1 2 X X 5 6 7 8
Initial Dataset
Created at v1
1 2 X 4 5 6
Index
Created
I1
...
Indexing Support for
ACID
Data
Formats
Snapshot isolation
Readers use a consistent
snapshot of the table (no
locks). All t...
Delta Lake
Time
Travel
Delta Lake
add
v1
add
v2
del
v3
add
v4
add
v5
add
v6
add
v7
1
2
3
1
2
3
4
1
2
3
4
1
2
3
4
5
1
2
3
4...
Indexing Support for
Delta Lake
Time Travel
Hyperspace
create
v1
refresh
v2
refresh
v3
User queries
snapshot at v4
Hypersp...
Azure Synapse Analytics
offers the best offering of Hyperspace’s indexing yet!
• No additional JAR includes
• Fastest access...
Experience of Using
Hyperspace for Delta Lake
Notebook: https://aka.ms/hyperspace-for-delta-lake
Preliminary
Performance
Evaluation of
Hyperspace for
Delta Lake
Compute Configuration:
• VM Instance = Azure E8 V3
• Worker...
Performance Implications of Using
Hyperspace for Delta Lake
Open Sourcing Hyperspace v0.1
Conclusion
New extensible indexing subsystem for
Apache Spark
Simply add on—no core changes ...
You’ve finished this document.
Download and read it offline.
Upcoming SlideShare
What to Upload to SlideShare
Next
Upcoming SlideShare
What to Upload to SlideShare
Next
Download to read offline and view in fullscreen.

Share

Hyperspace for Delta Lake

Download to read offline

Hyperspace is a recently open-sourced (https://github.com/microsoft/hyperspace) indexing sub-system from Microsoft. The key idea behind Hyperspace is simple: Users specify the indexes they want to build. Hyperspace builds these indexes using Apache Spark, and maintains metadata in its write-ahead log that is stored in the data lake. At runtime, Hyperspace automatically selects the best index to use for a given query without requiring users to rewrite their queries. Since Hyperspace was introduced, one of the most popular asks from the Spark community was indexing support for Delta Lake. In this talk, we present our experiences in designing and implementing Hyperspace support for Delta Lake and how it can be used for accelerating queries over Delta tables. We will cover the necessary foundations behind Delta Lake’s transaction log design and how Hyperspace enables indexing support that seamlessly works with the former’s time travel queries.

Hyperspace for Delta Lake

  1. 1. Hyperspace for Delta Lake Rahul Potharaju, Terry Kim, Eunjin Song Microsoft @RahulPotharaju
  2. 2. Who?
  3. 3. Rahul Potharaju Principal Software Engineering Manager @Microsoft Part of the Spark team at Microsoft Azure Synapse Analytics OSS: Hyperspace, .NET for Apache Spark You can also find me publishing in VLDB, NSDI etc. Terry Kim Principal Software Engineer @Microsoft Part of the Spark team at Microsoft Azure Synapse Analytics OSS: Hyperspace, .NET for Apache Spark, Apache Spark Eunjin Song Senior Software Engineer @Microsoft Part of the Spark team at Microsoft Azure Synapse Analytics OSS: Hyperspace
  4. 4. We work on everything Apache SparkTM Spark Runtime, Spark Service, HW Acc in Synapse, Debugging & Diagnostics Offer Apache SparkTM -as-a-service to Microsoft customers Runtimes for Synapse Spark, HDInsight Spark, Spark on Cosmos Contribute back to Apache SparkTM Spark SQL, Datasource v2, #47/1600 Spark contributor We open source our work! Hyperspace, .NET for Spark
  5. 5. Agenda ▪ Rahul Potharaju ▪ Background, Concepts, Conclusion ▪ Terry Kim ▪ Demo, Performance Deep-dive
  6. 6. What is Hyperspace?
  7. 7. Hyperspace in a Nutshell Simple Usage API // Index Maintenance createIndex(df: DataFrame, indexCfg: IndexConfig): Unit deleteIndex(indexName: String): Unit restoreIndex(indexName: String): Unit vacuumIndex(indexName: String): Unit refreshIndex(indexName: String): Unit cancel(indexName: String): Unit Language Choices Scala Python .NET New extensible indexing subsystem for Apache Spark Same technology that powers the indexing engine inside Azure Synapse Analytics Works out-of-box with open-source Apache Spark Accelerated performance on key workloads Highlights
  8. 8. Hyperspace Use Cases High-Concurrency Interactive Analytics and Data Export Indexing Privacy Attributes for GDPR Compliance Time-series Analytics Framework for Derived Dataset Maintenance Needle-in-a-haystack Queries
  9. 9. https://aka.ms/HyperspaceIntroTalk https://aka.ms/Hyperspace 16 Project Contributors 213 Pull Requests Merged 263 GitHub Stars 180 Issues Reported Up to 10x query performance improvement Open-Sourced @Spark+AI Summit 2020 https://aka.ms/Hyperspace-Blog https://github.com/microsoft/hyperspace
  10. 10. Top User Request from Spark+AI Summit 2020 & Microsoft Customers “Will Hyperspace work for Delta Lake?” Yes – giving back to the community like this makes me believe Microsoft is a very different company than twenty years ago. Definitely, a culture I would enjoy! Thank you!!! Great presentation. Just finished adding our own secondary indexing schema for training selection. Lots of common threads here. Good to see Microsoft contributing to the community with such awesome work. Kudos to the team. Thank you, quite interesting Great stuff. Thanks! Awesome work! Great stuff guys! This is cool! Very interesting presentation!
  11. 11. Hyperspace for Delta Lake Index Maintenance Hybrid Scan ACID Data Formats
  12. 12. 1 2 3 4 5 Data 1 2 3 Index Query Full Refresh • Slowest refresh/fastest query • Rebuilds the entire index Incremental Refresh • Slow refresh/fast query • Builds index on newly added files/partitions • Drops rows from index immediately 1 2 3 4 5 Data 1 2 3 Index Query .refresh(full) 4 5 6 6 7 8 9 10 6 7 8 9 10 4 5 6 .refresh(incremental) Query Query Index Maintenance Quick Refresh • Fastest refresh/fast query • Captures meta-data for appended and file/partition predicates for deletes • Leverage Hybrid Scan at runtime 1 2 3 4 5 Data 1 2 3 Index Query 6 7 8 9 10 .refresh(incremental) – Only updates meta-data Query Model Assumptions: Appends and Deletes are done at file or partition level i.e., in-place updates are not supported Index was constructed with lineage information
  13. 13. Query Processing: Hybrid Scan Data 1 2 3 4 5 6 1 2 X X 5 6 7 8 Initial Dataset Created at v1 1 2 X 4 5 6 Index Created I1 I1 Dataset Updated to v2 time Query on Dataset at v2 What does Hybrid Scan entail? Table A Table B Table B Ս σfile != 4 Index Scan using I1 as of Dataset v1 7 8 Shuffle Step 1: Compute diff since indexed [ {4, deleted}, {7, added}, {8, added} ] Step 2: Rewrite Table Scan as Hybrid Scan
  14. 14. Indexing Support for ACID Data Formats Snapshot isolation Readers use a consistent snapshot of the table (no locks). All table updates are atomic. Distributed planning File pruning and predicate push-down is distributed to jobs, removing the metastore as a bottleneck. Version history, rollback and time travel Table snapshots are kept as history and tables can roll back if a job produces bad data
  15. 15. Delta Lake Time Travel Delta Lake add v1 add v2 del v3 add v4 add v5 add v6 add v7 1 2 3 1 2 3 4 1 2 3 4 1 2 3 4 5 1 2 3 4 5 6 1 2 3 4 5 6 1 2 3 4 5 6 Files in the directory Spark Code for reading Delta Lake Table Current default version val df = spark .read .format("delta") .load(deltaTablePath) df.show() User queries v2 val df = spark .read .format("delta") .option("versionAsOf", 2) .load(deltaTablePath) df.show() Spark Code for Delta Lake Time Travel
  16. 16. Indexing Support for Delta Lake Time Travel Hyperspace create v1 refresh v2 refresh v3 User queries snapshot at v4 Hyperspace chooses hybrid scan over: - Index(v1) + Scan(DeltaLake(v4-v3)) User queries snapshot at v6 Delta Lake add v1 add v2 del v3 add v4 add v5 add v6 add v7 Hyperspace compares cost of hybrid scan over: - Index(v2) + Scan(DeltaLake(v6-v5)) - Index(v3) + Scan(DeltaLake(v6-v7)) Hyperspace chooses: - Index(v1) User queries snapshot at v3
  17. 17. Azure Synapse Analytics offers the best offering of Hyperspace’s indexing yet! • No additional JAR includes • Fastest access to latest features • Support for Scala | Python | .NET • Seamless integration with the UI • Meta-store integration • Notebooks for faster iterations
  18. 18. Experience of Using Hyperspace for Delta Lake Notebook: https://aka.ms/hyperspace-for-delta-lake
  19. 19. Preliminary Performance Evaluation of Hyperspace for Delta Lake Compute Configuration: • VM Instance = Azure E8 V3 • Workers/Executors = 7 • Cores per executors = 8 • Executor memory = 47 GB • Autoscale disabled • ADLS Gen v2 Experimental Setting TPC-DS: store_sales 1 2 3 4 5 6 200 … 1 GB TPC-DS: items 1 2 3 4 5 6 200 … 200 KB 200 GB 40 MB Performance of Hyperspace using TPC-DS Q44 201 250 … Append more data Performance of Hyperspace 1. Without refreshing the index 2. After refreshing the index
  20. 20. Performance Implications of Using Hyperspace for Delta Lake
  21. 21. Open Sourcing Hyperspace v0.1 Conclusion New extensible indexing subsystem for Apache Spark Simply add on—no core changes needed Same technology that powers the indexing engine inside Azure Synapse Analytics Works out-of-box with open-source Apache Spark Scala, Python, and .NET support Accelerated performance on key workloads Up to 10x query performance improvement https://github.com/microsoft/hyperspace Open Sourced It is not perfect… but fully open to contributions towards being made perfect! ☺ @RahulPotharaju
  • saurabhverma2412

    Jul. 24, 2021

Hyperspace is a recently open-sourced (https://github.com/microsoft/hyperspace) indexing sub-system from Microsoft. The key idea behind Hyperspace is simple: Users specify the indexes they want to build. Hyperspace builds these indexes using Apache Spark, and maintains metadata in its write-ahead log that is stored in the data lake. At runtime, Hyperspace automatically selects the best index to use for a given query without requiring users to rewrite their queries. Since Hyperspace was introduced, one of the most popular asks from the Spark community was indexing support for Delta Lake. In this talk, we present our experiences in designing and implementing Hyperspace support for Delta Lake and how it can be used for accelerating queries over Delta tables. We will cover the necessary foundations behind Delta Lake’s transaction log design and how Hyperspace enables indexing support that seamlessly works with the former’s time travel queries.

Views

Total views

128

On Slideshare

0

From embeds

0

Number of embeds

0

Actions

Downloads

9

Shares

0

Comments

0

Likes

1

×