Successfully reported this slideshow.

Hyperspace: An Indexing Subsystem for Apache Spark

2

Share

Upcoming SlideShare
Hyperspace for Delta Lake
Hyperspace for Delta Lake
Loading in …3
×
1 of 28
1 of 28

Hyperspace: An Indexing Subsystem for Apache Spark

2

Share

Download to read offline

At Microsoft, we store datasets (both from internal teams and external customers) ranging from a few GBs to 100s of PBs in our data lake. The scope of analytics on these datasets ranges from traditional batch-style queries (e.g., OLAP) to explorative, ‘finding needle in a haystack’ type of queries (e.g., point-lookups, summarization etc.).

At Microsoft, we store datasets (both from internal teams and external customers) ranging from a few GBs to 100s of PBs in our data lake. The scope of analytics on these datasets ranges from traditional batch-style queries (e.g., OLAP) to explorative, ‘finding needle in a haystack’ type of queries (e.g., point-lookups, summarization etc.).

More Related Content

Hyperspace: An Indexing Subsystem for Apache Spark

  1. 1. Hyperspace: An Indexing Subsystem for Apache Spark Rahul Potharaju & Terry Kim Microsoft
  2. 2. Who?
  3. 3. Rahul Potharaju Principal Software Engineering Manager @Microsoft Part of the Spark team at Microsoft Azure Synapse Analytics OSS: Hyperspace, .NET for Apache Spark Publish in academic conferences e.g., VLDB Terry Kim Principal Software Engineer @Microsoft Part of the Spark team at Microsoft Azure Synapse Analytics OSS: Hyperspace, Apache Spark, .NET for Apache Spark
  4. 4. We work on everything Spark Offer Spark-as-a- Service to Microsoft customers Contribute back to Apache Spark We open source our work!
  5. 5. Agenda Rahul Potharaju Background, Vision, Concepts, Call-for-Action, Conclusion Terry Kim Demo, Performance Deep-dive
  6. 6. What is an index!?
  7. 7. In databases, an ‘index’ is a data structure that improves the speed of data retrieval operations on a database table at the cost of additional writes and storage space to maintain the index data structure. Index from the back of a textbook N Namespace 493, 533, 544 Nested-loop join 718-722 Normalization 67, 85-92 Null value 33-35, 168, 252 See also Not-null constraint O Optimization = See Plan selection, Query optimization ORDER BY 255-256, 461 Ordering 461-463, 541-543 See also Join ordering, Sorting R Random walker 1147, 1154 Range query 639-640, 662-664 Read committed 304-305 READ 849 Read lock See Shared lock Read uncommitted 304 Relational calculus 241 Relational database system 3 S Semijoin 243
  8. 8. Overview of Hyperspace
  9. 9. Goals of Hyperspace Indexing Agnostic to Data Format Multi-engine Interoperability Extensible Indexing Infrastructure Security, Privacy & Compliance Should index data in the lake in any format, including text (e.g., CSV, JSON, Parquet, ORC, Avro, etc.) and binary data (e.g., videos, audios, images, etc.) Low-cost Index meta data management Should store all meta-data on the data lake and should not assume any other service to operate correctly Should make third-party engine integration (e.g., non-Spark systems) feasible, intuitive and easy – build index through Spark and leverage through Synapse SQL Should offer mechanisms for easy pluggability of newer auxiliary data structures (related to indexing) Should meet the necessary security, privacy, and compliance standards as auxiliary structures copy the original dataset either partly or in full
  10. 10. Data Lake Indexing Infrastructure Query Infrastructure User-facing Index Management APIs Allows interaction with the indexing ecosystem Optimizer Extensions making optimizer cost and index-aware, algorithms for index selection Index Recommendation allows index suggestions for query/workload What-If & Why-not allows index cost-benefit analysis & explainability Index Creation & Maintenance API primitives for index lifecycle management (e.g., creating, refreshing, deleting), enforcing retention, purge etc. Log Management API change log for enabling engine-interoperability Index Specifications layouts for enabling engine-interoperability Concurrency Model primitives for optimistic concurrency Datasets structured e.g., parquet and unstructured e.g., csv, tsv Index non-clustered (columnar covering index, chunk-elimination, statistics, views) Vision of Hyperspace Indexing
  11. 11. Hyperspace’s Usage API in Spark Usage Smarts Customization // Index Maintenance createIndex(df: DataFrame, indexCfg: IndexConfig): Unit deleteIndex(indexName: String): Unit restoreIndex(indexName: String): Unit vacuumIndex(indexName: String): Unit rebuildIndex(indexName: String): Unit cancel(indexName: String): Unit // Debugging and Index Recommendation explain(df: DataFrame): Unit whatIf(workload: Array[DataFrame], indexCfg: IndexConfig): Cost recommend(workload: Array[DataFrame], options: RecOptions): Recommendation // Configuration for Storage and Query Optimizer hyperspace.system.path hyperspace.index.creation.[path | namespace] hyperspace.index.search.[path | namespace] hyperspace.index.search.disablePublicIndexes Language Choices Scala Python .NET
  12. 12. … and btw, the indexes live on the data lake! Filesystem Root /indexes/<scope = public | user | namespace> <index name> _hyperspace_log create (active) refresh active … <index-directory-1> <index-directory-2> <index-directory-3> /path/to/data/1 data files
  13. 13. … and index-on-the-lake provides several benefits! Index scan scales Open format index Serverless access protocol
  14. 14. Azure Synapse Analytics offers the best offering of Hyperspace’s indexing yet! • No additional JAR includes • Fastest access to latest features • Support for Scala | Python | .NET • Seamless integration with the UI • Meta-store integration • Notebooks for faster iterations
  15. 15. Demo: Hello Hyperspace! Notebook: https://aka.ms/hellohyperspace
  16. 16. Our first hyperspace: the covering index Creates a “copy” of the original data in a different sort order. During optimization, reads from index instead of base table. Useful for eliminating shuffles and filtering predicates. a b c SELECT b WHERE a = ‘Red’ Full-scan (lineartime) a b Covering Index Index ON a Include b SELECT b WHERE a = ‘Red’ Binary Search (log time)
  17. 17. a b c SELECT b, c FROM Table A, B JOIN ON A.a = B.a a p q Table A Table B Without Indexes Step 1: Shuffle (data is not sorted) a b c a p q Table A Table B Step 2: Sort both sides a b c a p q Table A Table B Step 3: Merge a p q Result With Covering Indexes Step 1: Optimizer picks index (pre-shuffled, pre-sorted) a b c a p q Idx A Idx B Step 2: Merge a p q Result Shuffle eliminated Since shuffle is the most expensive step, this query might run faster at scale Our first hyperspace: the covering index
  18. 18. Demo: Deep-dive into Hyperspace’s Index-based Query Optimization
  19. 19. Hyperspace Performance
  20. 20. Preliminary Performance Evaluation of Hyperspace Covering Indexes Compute Configuration: • VM Instance = Azure E8 V3 • Workers/Executors = 7 • Cores per executors = 8 • Executor memory = 47 GB • Autoscale disabled • ADLS Gen v2 1.2 2.4 1.4 2.3 1.1 3.6 1.3 6.8 5.4 1.8 4.5 1.9 1.8 2.0 3.6 8.9 1.5 1.9 2.1 1.1 3.8 0.0 1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0 9.0 10.0 0 100 200 300 400 500 600 700 800 900 1000 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 17 18 19 20 21 22 Workload derived from TPC Benchmark™ H (TPC-H) (Scale Factor = 1000, Apache Spark 2.4, Parquet data) Baseline Hyperspace Gain No regressions, up to 9x gains 2.5 3.8 2.5 4.7 6.1 6.7 3.3 4.9 2.9 4.9 2.2 6.9 5.6 10.9 1.8 2.0 2.3 2.2 3.9 1.6 0.0 2.0 4.0 6.0 8.0 10.0 12.0 0 100 200 300 400 500 600 700 800 900 4 6 11 17 25 29 37 50 54 64 78 80 82 93 14a 14b 23a 23b 24a 24b Workload derived from TPC Benchmark™ DS (TPC-DS) - Top 20 (Scale Factor = 1000, Apache Spark 2.4, Parquet data) Baseline Hyperspace Gain Duration(seconds)Duration(seconds) No regressions, up to 11x gains
  21. 21. 2x 1.8x Hyperspace acceleration Workloads derived from TPC Benchmark™ H/DS (Scale Factor = 1000, Apache Spark 2.4, Parquet data) TPC-H TPC-DS Up to 11x query performance improvement Preliminary Performance Evaluation of Hyperspace Covering Indexes Compute Configuration: • VM Instance = Azure E8 V3 • Workers/Executors = 7 • Cores per executors = 8 • Executor memory = 47 GB • Autoscale disabled • ADLS Gen v2
  22. 22. Open Sourcing Hyperspace v0.1 New extensible indexing subsystem for Apache Spark Simply add on—no core changes needed Same technology that powers the indexing engine inside Azure Synapse Analytics Works out-of-box with open source Apache Spark Scala, Python, and .NET support Accelerated performance on key workloads https://github.com/microsoft/hyperspace OR https://aka.ms/hyperspace
  23. 23. Thanks to everyone who is making this possible…
  24. 24. Let us build Hyperspace together! Meta-data & Lifecycle Multi-engine interop, concurrency, support for views & stats Indexing enhancements Incremental indexing, index optimization, support for Delta Lake Optimizer enhancements More robust index & view selection, explainability Documentation & Tutorials Best practices, gotchas, more experiments More index types Critique existing design, new designs… more on this in next slide Index Recommendation Single query & multi-query workload-based recommendation 01 02 03 04 05 06
  25. 25. What type of hyperspaces can we build together? In Hyperspace, “index” is used broadly to refer to a derived dataset i.e., some auxiliary information about the underlying data that will aid in query acceleration COVERING INDEX Creates a “copy” of the original data in a different sort order. During optimization, reads from index instead of base table. Useful for eliminating shuffles and filtering predicates. CHUNK-ELIMINATION INDEX Creates a “pointer” from a search key back to the original data. During optimization, performs a first lookup to obtain the pointer. Useful for finding-needle-in-the-haystack queries. MATERIALIZED VIEWS Executes a (potentially complex) query, stores the results. During optimization, entire subtrees can be rewritten. Useful when the same result is computed several times. STATISTICS Collects statistics about the underlying dataset. During optimization, can power a cost-based optimizer. Useful for join re-ordering, index/view selection etc.
  26. 26. Open Sourcing Hyperspace v0.1 Conclusion New extensible indexing subsystem for Apache Spark Simply add on—no core changes needed Same technology that powers the indexing engine inside Azure Synapse Analytics Works out-of-box with open source Apache Spark Scala, Python, and .NET support Accelerated performance on key workloads 2x 1.8x Hyperspace acceleration (Scale Factor = 1000, Apache Spark 2.4, Parquet data) TPC-H TPC-DS Up to 10x query performance improvement https://github.com/microsoft/hyperspace Open Sourced today It is not perfect… but that’s where we need your guidance!
  27. 27. Feedback Your feedback is important to us. Don’t forget to rate and review the sessions.

×