Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Upcoming SlideShare
What to Upload to SlideShare
Next
Download to read offline and view in fullscreen.

Share

Data Distribution and Ordering for Efficient Data Source V2

Download to read offline

More and more companies adopt Spark 3 to benefit from various enhancements and performance optimizations like adaptive query execution and dynamic partition pruning. During this process, organizations consider migrating their data sources to the newly added Catalog API (aka Data Source V2), which provides a better way to develop reliable and efficient connectors. Unfortunately, there are a few limitations that prevent unleashing the full potential of the Catalog API. One of them is the inability to control the distribution and ordering of incoming data that has a profound impact on the performance of data sources.

This talk is going to be useful for developers and data engineers that either develop their own or work with existing data sources in Spark. The presentation will start with an overview of the Catalog API introduced in Spark 3, followed by its benefits and current limitations compared to the old Data Source API. The main focus will be on an extension to the Catalog API developed in SPARK-23889, which lets implementations control how Spark distributes and orders incoming records before passing them to the sink.

The extension not only allows data sources to reduce the memory footprint during writes but also to co-locate data for faster queries and better compression. Apart from that, the introduced API paves the way for more advanced features like partitioned joins.

Data Distribution and Ordering for Efficient Data Source V2

  1. 1. Data Distribution and Ordering for Efficient Data Source V2 Anton Okolnychyi This is not a contribution Data + AI Summit 2021
  2. 2. Presenter • Apache Iceberg PMC member • Apache Spark contributor • Data Lakes at Apple • Open source enthusiast
  3. 3. Agenda • Why V2? • Data distribution and ordering • Future work
  4. 4. What’s wrong with V1?
  5. 5. Reliability • Behavior of DataFrameWriter is not defined - Connectors interpret SaveMode differently - SaveIntoDataSourceCommand vs InsertIntoDataSourceCommand
  6. 6. Reliability • Validation rules are not consistent - PreprocessTableCreation vs PreprocessTableInsertion - No schema validation for path-based tables
  7. 7. Design choices • Connectors interact with internal APIs - SQLContext - RDD - DataFrame
  8. 8. Extensibility • Hard to support new features - No easy way to extend PrunedFilterScan - Exposing ColumnarBatch instead of Row is challenging
  9. 9. Features • No Structured Streaming support • No multi-catalog support • Limited bucketed tables support
  10. 10. What’s different in V2?
  11. 11. Reliability • Predictable and reliable behavior - Clearly defined logical plans for all connectors - Consistent validation rules - Less delegation to connectors
  12. 12. Design choices • Proper abstractions - Connectors interact only with InternalRow and ColumnarBatch - Mix-in traits for optional functionality
  13. 13. Features • Multi-catalog support • Structured Streaming • Vectorization • Bucketed tables (in progress)
  14. 14. Data distribution and ordering
  15. 15. Distribution
  16. 16. Distribution
  17. 17. Ordering
  18. 18. Ordering
  19. 19. Why should I care?
  20. 20. Impact • Writes - Control the number of generated files - Reduce the overall memory consumption - Reduce the actual writing time
  21. 21. © 2021 Apple Inc. All rights reserved. Redistribution or public display not permitted without written permission from Apple. Unspecified distribution
  22. 22. © 2021 Apple Inc. All rights reserved. Redistribution or public display not permitted without written permission from Apple. Unspecified distribution
  23. 23. © 2021 Apple Inc. All rights reserved. Redistribution or public display not permitted without written permission from Apple. Unspecified distribution
  24. 24. © 2021 Apple Inc. All rights reserved. Redistribution or public display not permitted without written permission from Apple. Proper distribution
  25. 25. © 2021 Apple Inc. All rights reserved. Redistribution or public display not permitted without written permission from Apple. Proper distribution
  26. 26. © 2021 Apple Inc. All rights reserved. Redistribution or public display not permitted without written permission from Apple. Proper distribution
  27. 27. © 2021 Apple Inc. All rights reserved. Redistribution or public display not permitted without written permission from Apple. Unspecified ordering
  28. 28. © 2021 Apple Inc. All rights reserved. Redistribution or public display not permitted without written permission from Apple. Unspecified ordering
  29. 29. © 2021 Apple Inc. All rights reserved. Redistribution or public display not permitted without written permission from Apple. Proper ordering
  30. 30. © 2021 Apple Inc. All rights reserved. Redistribution or public display not permitted without written permission from Apple. Proper ordering
  31. 31. Impact • Reads - Cluster data on write for faster reads - Enable efficient data skipping
  32. 32. Impact • Storage footprint - Columnar encodings perform better on sorted data (e.g. dictionary encoding)
  33. 33. How do connectors control this?
  34. 34. Data Source V1 • Connectors can apply arbitrary transformations on DataFrame • Built-in connectors sort data within tasks using partition columns
  35. 35. Data Source V2 • No way to control (SPARK-23889) • Severe performance issues unless explicitly handled by the user • Blocks migration to V2 • Fixed in upcoming Spark 3.2
  36. 36. Solution
  37. 37. Use cases • Global sort • Cluster + sort within tasks • Local sort within tasks • No distribution and sort
  38. 38. API interface WriteBuilder { Write build() }
  39. 39. API interface Write { BatchWrite toBatch(); StreamingWrite toStreaming(); }
  40. 40. API interface RequiresDistributionAndOrdering extends Write { Distribution requiredDistribution(); SortOrder[] requiredOrdering(); }
  41. 41. Distributions • OrderedDistribution • ClusteredDistribution • UnspecifiedDistribution
  42. 42. SortOrder interface SortOrder extends Expression { Expression expression(); SortDirection direction(); NullOrdering nullOrdering(); }
  43. 43. Current state • Available and fully functional in master for batch queries • Structured Streaming support is in progress (SPARK-34183)
  44. 44. Future work • Distribution and ordering in CREATE TABLE • Ability to control the number of shuffle partitions • Coalesce partitions during adaptive query execution
  45. 45. Key takeaways
  46. 46. Summary • Consider migrating to Data Source V2 • Data distribution and ordering is critical at scale
  47. 47. Feedback • Your feedback is important to us • Don’t forget to review and rate sessions
  48. 48. Thank you!
  49. 49. TM and © 2021 Apple Inc. All rights reserved.
  • manuzhang

    Aug. 22, 2021

More and more companies adopt Spark 3 to benefit from various enhancements and performance optimizations like adaptive query execution and dynamic partition pruning. During this process, organizations consider migrating their data sources to the newly added Catalog API (aka Data Source V2), which provides a better way to develop reliable and efficient connectors. Unfortunately, there are a few limitations that prevent unleashing the full potential of the Catalog API. One of them is the inability to control the distribution and ordering of incoming data that has a profound impact on the performance of data sources. This talk is going to be useful for developers and data engineers that either develop their own or work with existing data sources in Spark. The presentation will start with an overview of the Catalog API introduced in Spark 3, followed by its benefits and current limitations compared to the old Data Source API. The main focus will be on an extension to the Catalog API developed in SPARK-23889, which lets implementations control how Spark distributes and orders incoming records before passing them to the sink. The extension not only allows data sources to reduce the memory footprint during writes but also to co-locate data for faster queries and better compression. Apart from that, the introduced API paves the way for more advanced features like partitioned joins.

Views

Total views

133

On Slideshare

0

From embeds

0

Number of embeds

0

Actions

Downloads

14

Shares

0

Comments

0

Likes

1

×