Amazon's Exabyte-Scale Migration from Spark to Ray

Amazon’s
Exabyte-Scale
Migration from
Spark to Ray
Patrick Ames, Amazon

Overview 1. Introduction
2. The Compaction Efficiency Problem
3. The Spark to Ray Migration
4. Results
5. Future Work
6. Get Involved

Introduction
1. From Data Warehouse to Lakehouse
2. Business Intelligence?
3. A Brief History of Amazon BI
4.Ray?

From Data Warehouse to Lakehouse
2. Data Lake
Highly Scalable Raw Data Storage and
Retrieval
1. Data Warehouse
Coupled Storage and Compute over
Structured Data
4. Data Lakehouse
Decoupled Data Catalog + Compute over
Structured and Unstructured Data
Data Lake + Metadata (table names,
versions, schemas, descriptions, audit
logs, etc.)
3. Data Catalog

5
Decision Evolution
Business Intelligence
Business Intelligence Flywheel
Amazon is gradually transitioning large-scale BI pipelines into phases 2 and 3.
Amazon’s BI catalog provides a centralized hub to access exabytes of business data.

• 2016-2018: PB-Scale
Oracle Data Warehouse
Deprecation
• Migrated 50PB from Oracle Data
Warehouse to S3-Based Data
Catalog
• Decoupled storage with Amazon
Redshift & Apache Hive on
Amazon EMR Compute
A Brief History of
Amazon BI

• 2018-2019: EB-Scale Data
Catalog & Lakehouse
Formation
• Bring your own compute (EMR
Spark, AWS Glue, Amazon
Redshift Spectrum, etc.)
A Brief History of
Amazon BI

Catalog & Lakehouse
Formation
• LSM-based CDC “Compaction”
using Apache Spark on Amazon
EMR
A Brief History of
Amazon BI
Append-only compaction: Append deltas arrive in a table’s CDC log stream, where each delta
contains pointers to one or more S3 files containing records to insert into the table. During a
compaction job, no records are updated or deleted so the delta merge is a simple concatenation,
but the compactor is still responsible for writing out files sized appropriately to optimize reads
(i.e. merge tiny files into larger files and split massive files into smaller files).

Catalog & Lakehouse
Formation
• LSM-based CDC “Compaction”
using Apache Spark on Amazon
EMR
A Brief History of
Amazon BI
Upsert compaction: Append and Upsert deltas arrive in a table’s CDC log stream, where each
Upsert delta contains records to update or insert according to one or more merge keys. In this
case, column1 is used as the merge key, so only the latest column2 updates are kept per distinct
column1 value.

• 2019-2023: Ray
Integration
• EB-Scale Data Quality Analysis
• Spark-to-Ray Compaction
Migration
A Brief History of
Amazon BI
Ray Shadow Compaction Workflow: New deltas arriving in a data catalog table’s CDC log
stream are merged into two separate compacted tables maintained separately by Apache Spark
and Ray. The Data Reconciliation Service verifies that different data processing frameworks
produce equivalent results when querying datasets produced by Apache Spark and Ray, while the
Ray-based DQ Service compares key dataset statistics.

• 2024+: Ray Exclusivity
• Migrate all Table Queries to Ray
Compactor Output
• Turn off Spark Compactor
A Brief History of
Amazon BI
Ray Exclusive compaction: New deltas arriving in a data catalog table’s CDC log stream are
merged into only one compacted table maintained by Ray. Amazon BI tables are gradually being
migrated from Spark compaction to Ray-Exclusive compaction, starting with our largest tables.

Ray?
• Pythonic
• Provides distributed Python APIs for ML, data science, and general workloads.
• Intuitive
• Relatively simple to convert single-process Python to distributed.
• Scalable
• Can integrate PB-scale datasets with data processing and ML pipelines.
• Performant
• Reduces end-to-end latency of data processing and ML workflows.
• Efficient
• Reduces end-to-end cost of data processing and ML.
• Unified
• Can run all steps of mixed data processing, data science, and ML pipelines.

The Compaction Efficiency Problem
1. Efficiency Defined
2. Understanding Amazon BI Pricing
3.The Money Pit

Efficiency Defined
• Output/Input
• Physics: (Useful Energy Output) / (Energy Input)
• Computing: (Useful Compute Output) / (Resource Input)
• In Other Words… The More Useful Work You Complete With the Same Resources, the
More Efficient You Are
• Simple Example
• System A: Reads 2GB/min of input data with 1 CPU and 8GB RAM.
• System B: Reads 4GB/min of input data with 1 CPU and 8GB RAM.
• Conclusion: System B is 2X more efficient than System A.

Understanding Amazon BI Pricing
• Consumers Pay Per Byte Consumed from our Data Catalog
• Administrative costs divided evenly across 1000’s of distinct data consumers.
• Everyone is charged the same normalized amount per byte read.
• The Primary Administrative Cost is Compaction
• What does it do?
• Input is a list of structured S3 files with records to insert, update, and delete.
• Output is a set of S3 files with all inserts, updates, and deletes applied.
• Why is it so expensive?
• Runs after any write to active tables in our exabyte-scale data catalog.
• Incurs large compute cost to merge input, and storage cost to write output.
• Why do we do it?
• To cache a Read-Optimized View of the dataset.
• To enforces Security & Compliance Policies Like GDPR.

The Money Pit
• Consumers Pay Per
Byte Consumed
• The Primary
Administrative Cost
is Compaction

The Money Pit
Byte Consumed
• The Primary
Administrative Cost
is Compaction
💵

The Money Pit
Byte Consumed
• The Primary
Administrative Cost
is Compaction
💵
💵

The Money Pit
Byte Consumed
• The Primary
Administrative Cost
is Compaction
💵
💵
💵

The Money Pit
Byte Consumed
• The Primary
Administrative Cost
is Compaction
💵
💵
💵💵

The Money Pit
Byte Consumed
• The Primary
Administrative Cost
is Compaction
💵
💵
💵💵
💵

The Money Pit
Byte Consumed
• The Primary
Administrative Cost
is Compaction
💵
💵
💵💵
💵
💵

The Money Pit
Byte Consumed
• The Primary
Administrative Cost
is Compaction
💵
💵
💵💵
💵
💵
💵

The Spark to Ray Migration
1.Key Milestones
2.Challenges
3.Concessions Made
4.Lessons Learned

Key Milestones
1. 2019: Ray Investigation, Benchmarking, and Analysis
2. 2020: Ray Compactor (The Flash Compactor) Algorithm Design
3.2020: The Flash Compactor Proof-of-Concept Implementation
4.2021: Ray Job Management System Design & Implementation
5.2022: Exabyte-Scale Production Data Quality Analysis with Ray
6.2022: The Flash Compactor Shadows Spark in Production
7. 2023: Production Consumer Migration from Spark to The Flash Compactor on Ray
8.2024: Ray Compactor Exclusivity - Spark Compactor Shutdown Begins

26
Challenges
Budgeting for Deferred Results
• Costs increase before they decrease
• Operating Ray & Spark concurrently
• Largest 1% of tables are ~50% of cost
• Migrating largest 1% of tables first
• At scale, failure resilience takes time
Correctness
• Backwards compatibility with Spark
• Spark/Arrow type system differences
• Asserting correctness on a budget
• Automating data quality workflows
Regressive Datasets
• Merging millions of KB/MB-scale files
• Splitting TB-scale files
• Dealing with oversized merge keys
• Handling corrupt and missing data
• Poor table partitioning and schema
Operational Sustainability
• Cluster management (cost and scale)
• Reducing cluster start/stop times
• Eliminating out-of-memory errors
• When to use AWS Glue vs. EC2
• Proliferation of manual workloads

27
Challenges
Customer Experience & Paradoxical Goals
• Goal 1: Complete an EB-scale compute framework migration invisibly
• Goal 2: Customers just read a table and shouldn’t care if Spark or Ray produced it
• Goal 3: Customers need to know how complete the table they’re consuming is
So what do you tell them when the data produced by Spark and Ray are out of sync?
🤔

28
Challenges

29
Challenges
We also built dynamic routing per-table-consumer to get the right answer.
Q: So how complete IS this table?
A: Well, that kind of depends on who’s asking…

30
Backwards Compatibility
over Latest Tech
• Regressed from Parquet 2.X
to Spark-flavored Parquet 1.0
• Subsequent migrations
needed to move away from
backwards compatibility
Concessions Made
Multiple Implementations
over Unification
• No single implementation is
most optimal across all
dataset and hardware
variations
• Constantly balancing ROI of
maintaining vs. discarding
an implementation
Manual Memory
Management
over Automatic
• Difficult to predict object
store memory requirements
at cluster & task creation
time
• High efficiency self-
managed GC preferred over
fully managed GC

31
Lessons Learned
Budgeting for Deferred Results
• Don’t save the hardest problems for last
• Know time to recoup initial investment
• Show you can recoup initial investment
Correctness
• Your customers define correctness
• Type systems rarely agree on equality
• Don’t expect 100% result equality
• Build key dataset equivalence checks
Regressive Datasets
• Test against production data ASAP
• Code coverage != edge-case coverage
• Tests can’t replace production telemetry
• At scale, QA and telemetry converge
Operational Sustainability
• Murphy’s Law is always true at scale
• EC2 excels at larger, >10TB jobs
• Glue excels at smaller, <10TB jobs
• Automate manual ops early and often
• Building serverless Ray infra is hard

Results
1. Ray vs. Spark Production Metrics
2. Key Production Statistics
3. Fun Facts

Key Production Statistics
Ray DeltaCAT Compactor
October 2024
Job Runs
EC2 vCPUs
Provisioned
Data Merged Cost Efficiency Projected Annual Savings
>25K
jobs/day
>1.5MM
vCPUs/day
>40 PB/day
(pyarrow)
$2.74/TB
(on-demand r6g avg)
$0.59/TB
(spot r6g avg)
>220K EC2 vCPU Years
~$100MM/year
(on-demand r6g)

38
Fun Facts
Did you know…
1. All of our internal production Data Catalog API calls go from Python to Java?
2. We have a critical prod dependency on Py4j, and it hasn’t caused a single issue yet?
But we spent A LOT of time stabilizing this Py4j dependency before going to prod!
3. Our production compactor code is checked into the open source DeltaCAT project?
And we’ve started testing it on Iceberg!
So if you find any issues or would like to help, please let us know!

Conclusion
1.Spark to Ray?
2.Key Takeaways

40
Spark to Ray?
Should I start migrating my Spark jobs to Ray?
• Spark still provides more general and feature-rich data processing abstractions.
• There are no paved roads to translate all Spark applications to Ray-native equivalents.
• Don’t expect comparable improvements by just running your Spark jobs on RayDP.
Probably Not
(not yet, at least)

41
Key Takeaways
So What Do These Results Mean?
• The flexibility of Ray Core lets you craft optimal solutions to very specific problems.
• You may want to focus on rewriting your most expensive distributed compute jobs on Ray.
• Our compactor provides one specific example of how a targeted migration to Ray can pay off.
• The Exoshuffle paper shows a more general example of how Ray can improve data processing. (and
Ray on EC2 still holds the 100TB Cloud Sort Cost Record of $97!).
Ray has the potential to be a world-class big data processing
framework, but realizing that potential takes A LOT of work today!

+
Apache Iceberg Compaction on DeltaCAT
https://github.com/ray-project/deltacat/ https://github.com/apache/iceberg

Apache Iceberg Compaction on DeltaCAT
→ What Will it Do?
+ Improve Iceberg table copy-on-write / merge-on-read efficiency & scalability.
+ Improve reading Iceberg equality deletes written by Apache Flink.
→ When Can I Use It?
+ Targeting a stable open source release in early 2025.
+ Currently running internal tests to verify correctness, stability, efficiency, etc.
→ What Can I Do Today?
+ Run local and distributed Iceberg table reads and writes via on Ray.
+ https://www.getdaft.io/

Thank You
Ray Community Slack
@Patrick Ames
DeltaCAT Project Homepage
https://github.com/ray-project/deltacat
Read More on the AWS Open Source Blog
https://aws.amazon.com/blogs/opensource/

Amazon's Exabyte-Scale Migration from Spark to Ray

More Related Content

Similar to Amazon's Exabyte-Scale Migration from Spark to Ray

More from All Things Open

Recently uploaded

Amazon's Exabyte-Scale Migration from Spark to Ray