Amazon’s
Exabyte-Scale
Migration from
Spark to Ray
Patrick Ames, Amazon
Overview 1. Introduction
2. The Compaction Efficiency Problem
3. The Spark to Ray Migration
4. Results
5. Future Work
6. Get Involved
Introduction
1. From Data Warehouse to Lakehouse
2. Business Intelligence?
3. A Brief History of Amazon BI
4.Ray?
From Data Warehouse to Lakehouse
2. Data Lake
Highly Scalable Raw Data Storage and
Retrieval
1. Data Warehouse
Coupled Storage and Compute over
Structured Data
4. Data Lakehouse
Decoupled Data Catalog + Compute over
Structured and Unstructured Data
Data Lake + Metadata (table names,
versions, schemas, descriptions, audit
logs, etc.)
3. Data Catalog
5
Decision Evolution
Business Intelligence
Business Intelligence Flywheel
Amazon is gradually transitioning large-scale BI pipelines into phases 2 and 3.
Amazon’s BI catalog provides a centralized hub to access exabytes of business data.
• 2016-2018: PB-Scale
Oracle Data Warehouse
Deprecation
• Migrated 50PB from Oracle Data
Warehouse to S3-Based Data
Catalog
• Decoupled storage with Amazon
Redshift & Apache Hive on
Amazon EMR Compute
A Brief History of
Amazon BI
• 2018-2019: EB-Scale Data
Catalog & Lakehouse
Formation
• Bring your own compute (EMR
Spark, AWS Glue, Amazon
Redshift Spectrum, etc.)
A Brief History of
Amazon BI
• 2018-2019: EB-Scale Data
Catalog & Lakehouse
Formation
• Bring your own compute (EMR
Spark, AWS Glue, Amazon
Redshift Spectrum, etc.)
• LSM-based CDC “Compaction”
using Apache Spark on Amazon
EMR
A Brief History of
Amazon BI
Append-only compaction: Append deltas arrive in a table’s CDC log stream, where each delta
contains pointers to one or more S3 files containing records to insert into the table. During a
compaction job, no records are updated or deleted so the delta merge is a simple concatenation,
but the compactor is still responsible for writing out files sized appropriately to optimize reads
(i.e. merge tiny files into larger files and split massive files into smaller files).
• 2018-2019: EB-Scale Data
Catalog & Lakehouse
Formation
• Bring your own compute (EMR
Spark, AWS Glue, Amazon
Redshift Spectrum, etc.)
• LSM-based CDC “Compaction”
using Apache Spark on Amazon
EMR
A Brief History of
Amazon BI
Upsert compaction: Append and Upsert deltas arrive in a table’s CDC log stream, where each
Upsert delta contains records to update or insert according to one or more merge keys. In this
case, column1 is used as the merge key, so only the latest column2 updates are kept per distinct
column1 value.
• 2019-2023: Ray
Integration
• EB-Scale Data Quality Analysis
• Spark-to-Ray Compaction
Migration
A Brief History of
Amazon BI
Ray Shadow Compaction Workflow: New deltas arriving in a data catalog table’s CDC log
stream are merged into two separate compacted tables maintained separately by Apache Spark
and Ray. The Data Reconciliation Service verifies that different data processing frameworks
produce equivalent results when querying datasets produced by Apache Spark and Ray, while the
Ray-based DQ Service compares key dataset statistics.
• 2024+: Ray Exclusivity
• Migrate all Table Queries to Ray
Compactor Output
• Turn off Spark Compactor
A Brief History of
Amazon BI
Ray Exclusive compaction: New deltas arriving in a data catalog table’s CDC log stream are
merged into only one compacted table maintained by Ray. Amazon BI tables are gradually being
migrated from Spark compaction to Ray-Exclusive compaction, starting with our largest tables.
Ray?
• Pythonic
• Provides distributed Python APIs for ML, data science, and general workloads.
• Intuitive
• Relatively simple to convert single-process Python to distributed.
• Scalable
• Can integrate PB-scale datasets with data processing and ML pipelines.
• Performant
• Reduces end-to-end latency of data processing and ML workflows.
• Efficient
• Reduces end-to-end cost of data processing and ML.
• Unified
• Can run all steps of mixed data processing, data science, and ML pipelines.
The Compaction Efficiency Problem
1. Efficiency Defined
2. Understanding Amazon BI Pricing
3.The Money Pit
Efficiency Defined
• Output/Input
• Physics: (Useful Energy Output) / (Energy Input)
• Computing: (Useful Compute Output) / (Resource Input)
• In Other Words… The More Useful Work You Complete With the Same Resources, the
More Efficient You Are
• Simple Example
• System A: Reads 2GB/min of input data with 1 CPU and 8GB RAM.
• System B: Reads 4GB/min of input data with 1 CPU and 8GB RAM.
• Conclusion: System B is 2X more efficient than System A.
Understanding Amazon BI Pricing
• Consumers Pay Per Byte Consumed from our Data Catalog
• Administrative costs divided evenly across 1000’s of distinct data consumers.
• Everyone is charged the same normalized amount per byte read.
• The Primary Administrative Cost is Compaction
• What does it do?
• Input is a list of structured S3 files with records to insert, update, and delete.
• Output is a set of S3 files with all inserts, updates, and deletes applied.
• Why is it so expensive?
• Runs after any write to active tables in our exabyte-scale data catalog.
• Incurs large compute cost to merge input, and storage cost to write output.
• Why do we do it?
• To cache a Read-Optimized View of the dataset.
• To enforces Security & Compliance Policies Like GDPR.
The Money Pit
• Consumers Pay Per
Byte Consumed
• The Primary
Administrative Cost
is Compaction
The Money Pit
• Consumers Pay Per
Byte Consumed
• The Primary
Administrative Cost
is Compaction
💵
The Money Pit
• Consumers Pay Per
Byte Consumed
• The Primary
Administrative Cost
is Compaction
💵
💵
The Money Pit
• Consumers Pay Per
Byte Consumed
• The Primary
Administrative Cost
is Compaction
💵
💵
💵
The Money Pit
• Consumers Pay Per
Byte Consumed
• The Primary
Administrative Cost
is Compaction
💵
💵
💵💵
The Money Pit
• Consumers Pay Per
Byte Consumed
• The Primary
Administrative Cost
is Compaction
💵
💵
💵💵
💵
The Money Pit
• Consumers Pay Per
Byte Consumed
• The Primary
Administrative Cost
is Compaction
💵
💵
💵💵
💵
💵
The Money Pit
• Consumers Pay Per
Byte Consumed
• The Primary
Administrative Cost
is Compaction
💵
💵
💵💵
💵
💵
💵
The Spark to Ray Migration
1.Key Milestones
2.Challenges
3.Concessions Made
4.Lessons Learned
Key Milestones
1. 2019: Ray Investigation, Benchmarking, and Analysis
2. 2020: Ray Compactor (The Flash Compactor) Algorithm Design
3.2020: The Flash Compactor Proof-of-Concept Implementation
4.2021: Ray Job Management System Design & Implementation
5.2022: Exabyte-Scale Production Data Quality Analysis with Ray
6.2022: The Flash Compactor Shadows Spark in Production
7. 2023: Production Consumer Migration from Spark to The Flash Compactor on Ray
8.2024: Ray Compactor Exclusivity - Spark Compactor Shutdown Begins
26
Challenges
Budgeting for Deferred Results
• Costs increase before they decrease
• Operating Ray & Spark concurrently
• Largest 1% of tables are ~50% of cost
• Migrating largest 1% of tables first
• At scale, failure resilience takes time
Correctness
• Backwards compatibility with Spark
• Spark/Arrow type system differences
• Asserting correctness on a budget
• Automating data quality workflows
Regressive Datasets
• Merging millions of KB/MB-scale files
• Splitting TB-scale files
• Dealing with oversized merge keys
• Handling corrupt and missing data
• Poor table partitioning and schema
Operational Sustainability
• Cluster management (cost and scale)
• Reducing cluster start/stop times
• Eliminating out-of-memory errors
• When to use AWS Glue vs. EC2
• Proliferation of manual workloads
27
Challenges
Customer Experience & Paradoxical Goals
• Goal 1: Complete an EB-scale compute framework migration invisibly
• Goal 2: Customers just read a table and shouldn’t care if Spark or Ray produced it
• Goal 3: Customers need to know how complete the table they’re consuming is
So what do you tell them when the data produced by Spark and Ray are out of sync?
🤔
28
Challenges
Customer Experience & Paradoxical Goals
29
Challenges
Customer Experience & Paradoxical Goals
We also built dynamic routing per-table-consumer to get the right answer.
Q: So how complete IS this table?
A: Well, that kind of depends on who’s asking…
30
Backwards Compatibility
over Latest Tech
• Regressed from Parquet 2.X
to Spark-flavored Parquet 1.0
• Subsequent migrations
needed to move away from
backwards compatibility
Concessions Made
Multiple Implementations
over Unification
• No single implementation is
most optimal across all
dataset and hardware
variations
• Constantly balancing ROI of
maintaining vs. discarding
an implementation
Manual Memory
Management
over Automatic
• Difficult to predict object
store memory requirements
at cluster & task creation
time
• High efficiency self-
managed GC preferred over
fully managed GC
31
Lessons Learned
Budgeting for Deferred Results
• Don’t save the hardest problems for last
• Know time to recoup initial investment
• Show you can recoup initial investment
Correctness
• Your customers define correctness
• Type systems rarely agree on equality
• Don’t expect 100% result equality
• Build key dataset equivalence checks
Regressive Datasets
• Test against production data ASAP
• Code coverage != edge-case coverage
• Tests can’t replace production telemetry
• At scale, QA and telemetry converge
Operational Sustainability
• Murphy’s Law is always true at scale
• EC2 excels at larger, >10TB jobs
• Glue excels at smaller, <10TB jobs
• Automate manual ops early and often
• Building serverless Ray infra is hard
Results
1. Ray vs. Spark Production Metrics
2. Key Production Statistics
3. Fun Facts
Key Production Statistics
Ray DeltaCAT Compactor
October 2024
Job Runs
EC2 vCPUs
Provisioned
Data Merged Cost Efficiency Projected Annual Savings
>25K
jobs/day
>1.5MM
vCPUs/day
>40 PB/day
(pyarrow)
$2.74/TB
(on-demand r6g avg)
$0.59/TB
(spot r6g avg)
>220K EC2 vCPU Years
~$100MM/year
(on-demand r6g)
38
Fun Facts
Did you know…
1. All of our internal production Data Catalog API calls go from Python to Java?
2. We have a critical prod dependency on Py4j, and it hasn’t caused a single issue yet?
But we spent A LOT of time stabilizing this Py4j dependency before going to prod!
3. Our production compactor code is checked into the open source DeltaCAT project?
And we’ve started testing it on Iceberg!
So if you find any issues or would like to help, please let us know!
Conclusion
1.Spark to Ray?
2.Key Takeaways
40
Spark to Ray?
Should I start migrating my Spark jobs to Ray?
• Spark still provides more general and feature-rich data processing abstractions.
• There are no paved roads to translate all Spark applications to Ray-native equivalents.
• Don’t expect comparable improvements by just running your Spark jobs on RayDP.
Probably Not
(not yet, at least)
41
Key Takeaways
So What Do These Results Mean?
• The flexibility of Ray Core lets you craft optimal solutions to very specific problems.
• You may want to focus on rewriting your most expensive distributed compute jobs on Ray.
• Our compactor provides one specific example of how a targeted migration to Ray can pay off.
• The Exoshuffle paper shows a more general example of how Ray can improve data processing. (and
Ray on EC2 still holds the 100TB Cloud Sort Cost Record of $97!).
Ray has the potential to be a world-class big data processing
framework, but realizing that potential takes A LOT of work today!
Future Work?
+
Apache Iceberg Compaction on DeltaCAT
https://github.com/ray-project/deltacat/ https://github.com/apache/iceberg
Apache Iceberg Compaction on DeltaCAT
→ What Will it Do?
+ Improve Iceberg table copy-on-write / merge-on-read efficiency & scalability.
+ Improve reading Iceberg equality deletes written by Apache Flink.
→ When Can I Use It?
+ Targeting a stable open source release in early 2025.
+ Currently running internal tests to verify correctness, stability, efficiency, etc.
→ What Can I Do Today?
+ Run local and distributed Iceberg table reads and writes via on Ray.
+ https://www.getdaft.io/
Thank You
Ray Community Slack
@Patrick Ames
DeltaCAT Project Homepage
https://github.com/ray-project/deltacat
Read More on the AWS Open Source Blog
https://aws.amazon.com/blogs/opensource/

Amazon's Exabyte-Scale Migration from Spark to Ray

  • 1.
  • 2.
    Overview 1. Introduction 2.The Compaction Efficiency Problem 3. The Spark to Ray Migration 4. Results 5. Future Work 6. Get Involved
  • 3.
    Introduction 1. From DataWarehouse to Lakehouse 2. Business Intelligence? 3. A Brief History of Amazon BI 4.Ray?
  • 4.
    From Data Warehouseto Lakehouse 2. Data Lake Highly Scalable Raw Data Storage and Retrieval 1. Data Warehouse Coupled Storage and Compute over Structured Data 4. Data Lakehouse Decoupled Data Catalog + Compute over Structured and Unstructured Data Data Lake + Metadata (table names, versions, schemas, descriptions, audit logs, etc.) 3. Data Catalog
  • 5.
    5 Decision Evolution Business Intelligence BusinessIntelligence Flywheel Amazon is gradually transitioning large-scale BI pipelines into phases 2 and 3. Amazon’s BI catalog provides a centralized hub to access exabytes of business data.
  • 6.
    • 2016-2018: PB-Scale OracleData Warehouse Deprecation • Migrated 50PB from Oracle Data Warehouse to S3-Based Data Catalog • Decoupled storage with Amazon Redshift & Apache Hive on Amazon EMR Compute A Brief History of Amazon BI
  • 7.
    • 2018-2019: EB-ScaleData Catalog & Lakehouse Formation • Bring your own compute (EMR Spark, AWS Glue, Amazon Redshift Spectrum, etc.) A Brief History of Amazon BI
  • 8.
    • 2018-2019: EB-ScaleData Catalog & Lakehouse Formation • Bring your own compute (EMR Spark, AWS Glue, Amazon Redshift Spectrum, etc.) • LSM-based CDC “Compaction” using Apache Spark on Amazon EMR A Brief History of Amazon BI Append-only compaction: Append deltas arrive in a table’s CDC log stream, where each delta contains pointers to one or more S3 files containing records to insert into the table. During a compaction job, no records are updated or deleted so the delta merge is a simple concatenation, but the compactor is still responsible for writing out files sized appropriately to optimize reads (i.e. merge tiny files into larger files and split massive files into smaller files).
  • 9.
    • 2018-2019: EB-ScaleData Catalog & Lakehouse Formation • Bring your own compute (EMR Spark, AWS Glue, Amazon Redshift Spectrum, etc.) • LSM-based CDC “Compaction” using Apache Spark on Amazon EMR A Brief History of Amazon BI Upsert compaction: Append and Upsert deltas arrive in a table’s CDC log stream, where each Upsert delta contains records to update or insert according to one or more merge keys. In this case, column1 is used as the merge key, so only the latest column2 updates are kept per distinct column1 value.
  • 10.
    • 2019-2023: Ray Integration •EB-Scale Data Quality Analysis • Spark-to-Ray Compaction Migration A Brief History of Amazon BI Ray Shadow Compaction Workflow: New deltas arriving in a data catalog table’s CDC log stream are merged into two separate compacted tables maintained separately by Apache Spark and Ray. The Data Reconciliation Service verifies that different data processing frameworks produce equivalent results when querying datasets produced by Apache Spark and Ray, while the Ray-based DQ Service compares key dataset statistics.
  • 11.
    • 2024+: RayExclusivity • Migrate all Table Queries to Ray Compactor Output • Turn off Spark Compactor A Brief History of Amazon BI Ray Exclusive compaction: New deltas arriving in a data catalog table’s CDC log stream are merged into only one compacted table maintained by Ray. Amazon BI tables are gradually being migrated from Spark compaction to Ray-Exclusive compaction, starting with our largest tables.
  • 12.
    Ray? • Pythonic • Providesdistributed Python APIs for ML, data science, and general workloads. • Intuitive • Relatively simple to convert single-process Python to distributed. • Scalable • Can integrate PB-scale datasets with data processing and ML pipelines. • Performant • Reduces end-to-end latency of data processing and ML workflows. • Efficient • Reduces end-to-end cost of data processing and ML. • Unified • Can run all steps of mixed data processing, data science, and ML pipelines.
  • 13.
    The Compaction EfficiencyProblem 1. Efficiency Defined 2. Understanding Amazon BI Pricing 3.The Money Pit
  • 14.
    Efficiency Defined • Output/Input •Physics: (Useful Energy Output) / (Energy Input) • Computing: (Useful Compute Output) / (Resource Input) • In Other Words… The More Useful Work You Complete With the Same Resources, the More Efficient You Are • Simple Example • System A: Reads 2GB/min of input data with 1 CPU and 8GB RAM. • System B: Reads 4GB/min of input data with 1 CPU and 8GB RAM. • Conclusion: System B is 2X more efficient than System A.
  • 15.
    Understanding Amazon BIPricing • Consumers Pay Per Byte Consumed from our Data Catalog • Administrative costs divided evenly across 1000’s of distinct data consumers. • Everyone is charged the same normalized amount per byte read. • The Primary Administrative Cost is Compaction • What does it do? • Input is a list of structured S3 files with records to insert, update, and delete. • Output is a set of S3 files with all inserts, updates, and deletes applied. • Why is it so expensive? • Runs after any write to active tables in our exabyte-scale data catalog. • Incurs large compute cost to merge input, and storage cost to write output. • Why do we do it? • To cache a Read-Optimized View of the dataset. • To enforces Security & Compliance Policies Like GDPR.
  • 16.
    The Money Pit •Consumers Pay Per Byte Consumed • The Primary Administrative Cost is Compaction
  • 17.
    The Money Pit •Consumers Pay Per Byte Consumed • The Primary Administrative Cost is Compaction 💵
  • 18.
    The Money Pit •Consumers Pay Per Byte Consumed • The Primary Administrative Cost is Compaction 💵 💵
  • 19.
    The Money Pit •Consumers Pay Per Byte Consumed • The Primary Administrative Cost is Compaction 💵 💵 💵
  • 20.
    The Money Pit •Consumers Pay Per Byte Consumed • The Primary Administrative Cost is Compaction 💵 💵 💵💵
  • 21.
    The Money Pit •Consumers Pay Per Byte Consumed • The Primary Administrative Cost is Compaction 💵 💵 💵💵 💵
  • 22.
    The Money Pit •Consumers Pay Per Byte Consumed • The Primary Administrative Cost is Compaction 💵 💵 💵💵 💵 💵
  • 23.
    The Money Pit •Consumers Pay Per Byte Consumed • The Primary Administrative Cost is Compaction 💵 💵 💵💵 💵 💵 💵
  • 24.
    The Spark toRay Migration 1.Key Milestones 2.Challenges 3.Concessions Made 4.Lessons Learned
  • 25.
    Key Milestones 1. 2019:Ray Investigation, Benchmarking, and Analysis 2. 2020: Ray Compactor (The Flash Compactor) Algorithm Design 3.2020: The Flash Compactor Proof-of-Concept Implementation 4.2021: Ray Job Management System Design & Implementation 5.2022: Exabyte-Scale Production Data Quality Analysis with Ray 6.2022: The Flash Compactor Shadows Spark in Production 7. 2023: Production Consumer Migration from Spark to The Flash Compactor on Ray 8.2024: Ray Compactor Exclusivity - Spark Compactor Shutdown Begins
  • 26.
    26 Challenges Budgeting for DeferredResults • Costs increase before they decrease • Operating Ray & Spark concurrently • Largest 1% of tables are ~50% of cost • Migrating largest 1% of tables first • At scale, failure resilience takes time Correctness • Backwards compatibility with Spark • Spark/Arrow type system differences • Asserting correctness on a budget • Automating data quality workflows Regressive Datasets • Merging millions of KB/MB-scale files • Splitting TB-scale files • Dealing with oversized merge keys • Handling corrupt and missing data • Poor table partitioning and schema Operational Sustainability • Cluster management (cost and scale) • Reducing cluster start/stop times • Eliminating out-of-memory errors • When to use AWS Glue vs. EC2 • Proliferation of manual workloads
  • 27.
    27 Challenges Customer Experience &Paradoxical Goals • Goal 1: Complete an EB-scale compute framework migration invisibly • Goal 2: Customers just read a table and shouldn’t care if Spark or Ray produced it • Goal 3: Customers need to know how complete the table they’re consuming is So what do you tell them when the data produced by Spark and Ray are out of sync? 🤔
  • 28.
  • 29.
    29 Challenges Customer Experience &Paradoxical Goals We also built dynamic routing per-table-consumer to get the right answer. Q: So how complete IS this table? A: Well, that kind of depends on who’s asking…
  • 30.
    30 Backwards Compatibility over LatestTech • Regressed from Parquet 2.X to Spark-flavored Parquet 1.0 • Subsequent migrations needed to move away from backwards compatibility Concessions Made Multiple Implementations over Unification • No single implementation is most optimal across all dataset and hardware variations • Constantly balancing ROI of maintaining vs. discarding an implementation Manual Memory Management over Automatic • Difficult to predict object store memory requirements at cluster & task creation time • High efficiency self- managed GC preferred over fully managed GC
  • 31.
    31 Lessons Learned Budgeting forDeferred Results • Don’t save the hardest problems for last • Know time to recoup initial investment • Show you can recoup initial investment Correctness • Your customers define correctness • Type systems rarely agree on equality • Don’t expect 100% result equality • Build key dataset equivalence checks Regressive Datasets • Test against production data ASAP • Code coverage != edge-case coverage • Tests can’t replace production telemetry • At scale, QA and telemetry converge Operational Sustainability • Murphy’s Law is always true at scale • EC2 excels at larger, >10TB jobs • Glue excels at smaller, <10TB jobs • Automate manual ops early and often • Building serverless Ray infra is hard
  • 32.
    Results 1. Ray vs.Spark Production Metrics 2. Key Production Statistics 3. Fun Facts
  • 37.
    Key Production Statistics RayDeltaCAT Compactor October 2024 Job Runs EC2 vCPUs Provisioned Data Merged Cost Efficiency Projected Annual Savings >25K jobs/day >1.5MM vCPUs/day >40 PB/day (pyarrow) $2.74/TB (on-demand r6g avg) $0.59/TB (spot r6g avg) >220K EC2 vCPU Years ~$100MM/year (on-demand r6g)
  • 38.
    38 Fun Facts Did youknow… 1. All of our internal production Data Catalog API calls go from Python to Java? 2. We have a critical prod dependency on Py4j, and it hasn’t caused a single issue yet? But we spent A LOT of time stabilizing this Py4j dependency before going to prod! 3. Our production compactor code is checked into the open source DeltaCAT project? And we’ve started testing it on Iceberg! So if you find any issues or would like to help, please let us know!
  • 39.
  • 40.
    40 Spark to Ray? ShouldI start migrating my Spark jobs to Ray? • Spark still provides more general and feature-rich data processing abstractions. • There are no paved roads to translate all Spark applications to Ray-native equivalents. • Don’t expect comparable improvements by just running your Spark jobs on RayDP. Probably Not (not yet, at least)
  • 41.
    41 Key Takeaways So WhatDo These Results Mean? • The flexibility of Ray Core lets you craft optimal solutions to very specific problems. • You may want to focus on rewriting your most expensive distributed compute jobs on Ray. • Our compactor provides one specific example of how a targeted migration to Ray can pay off. • The Exoshuffle paper shows a more general example of how Ray can improve data processing. (and Ray on EC2 still holds the 100TB Cloud Sort Cost Record of $97!). Ray has the potential to be a world-class big data processing framework, but realizing that potential takes A LOT of work today!
  • 42.
  • 43.
    + Apache Iceberg Compactionon DeltaCAT https://github.com/ray-project/deltacat/ https://github.com/apache/iceberg
  • 44.
    Apache Iceberg Compactionon DeltaCAT → What Will it Do? + Improve Iceberg table copy-on-write / merge-on-read efficiency & scalability. + Improve reading Iceberg equality deletes written by Apache Flink. → When Can I Use It? + Targeting a stable open source release in early 2025. + Currently running internal tests to verify correctness, stability, efficiency, etc. → What Can I Do Today? + Run local and distributed Iceberg table reads and writes via on Ray. + https://www.getdaft.io/
  • 45.
    Thank You Ray CommunitySlack @Patrick Ames DeltaCAT Project Homepage https://github.com/ray-project/deltacat Read More on the AWS Open Source Blog https://aws.amazon.com/blogs/opensource/