Incremental
Iceberg Table
Replication At
Scale
Szehon Ho & Hongyue Zhang
June 9–12, 2025
Did You Know It’s Hard to Copy an Iceberg
Table?
Iceberg Table Spec chose Absolute
Paths
• That’s great, as there’s no
ambiguity what files the table
refers to
2
But, some things become hard
• Migrating a table
• Copying a table
• Backup and Disaster Recovery
Absolute Paths
All references in Iceberg
• Catalogs
• Metadata.json (table version file)
• Manifest-List (snapshot file)
• Manifest File
• Position Delete Files (V2)
• Delete Vector Puffin Files (V3)
Copying an Iceberg table to another
locationwill break all these
references.
3
CREATE TABLE target AS
SELECT *
FROM source;
Expensive (Process all rows)
Lose all snapshots and histories
4
Create Table As Select Drawbacks
Workaround #1 (CTAS)
1. Find all source table’s current
data files
2. DistCP all data files
3. Create target iceberg table
4. Use add-files Spark procedure
to add copied parquet files to
target table
Delete Files broken (its file
references are invalid)
Broken schema and partition
evolution
Still lose all snapshots and
histories
5
Copy Parquet Files Drawbacks
Workaround #2 (Copy and Add files)
1. Create Iceberg table with
MRAP location
2. Enable cross-region
replication
Only supported AWS S3 buckets
Fail-over to a secondary region
require user action
Replication lag up to 15 minutes
Only for specific Disaster
Recovery use case
6
Copy Parquet Files Drawbacks
Workaround #3 (Multi-Region Access Point)
Table Replication Workflow
Rewrite all metadata files,
replacing prefixes in absolute
paths and save these to staging
location
External Copy Tool (ie, DistCP)
to copy list of files (staging
metadata files + original data
files) from source to target
Check referential Integrity
before register latest copied
metadata file in target catalog
as new table
Rewrite Copy Register
7
Source => Target in 3 steps
RewriteTablePaths SparkAction
Args = (Source Location, Target Location, Staging Location)
1. Rewrite all metadata files with absolute paths, replacing source => target
2. Save these to staging location
Return Values
3. List of files to copy (staging metadata files + original data files)
4. Name of latest metadata.json rewritten (will be target table)
Typical Table
Commit Adds:
● New Version
● New Snapshot
● New Manifest
● New Data File
References
● Down
● Back
/source/metadata-2.json
MetadataLog
● /source/metadata-1.json
Snapshots
● /source/snap-1.avro
● /source/snap-2.avro
/source/snap-1.avro
Manifests
● /source/m1.avro
/source/snap-2.avro
Manifests
● /source/m1.avro
● /source/m2.avro
/source/manifest-1.avro
DataFile
● /source/data-1.parquet
/source/manifest-2.avro
DataFiles
● /source/data-2.parquet
/source/data-2.parquet
/source/data-1.parquet
Source_Table
/source/metadata-1.json
Snapshots
● /source/snap-1.avro
/source/metadata-2.json
MetadataLog
● /source/metadata-1.json
Snapshots
● /source/snap-1.avro
● /source/snap-2.avro
/source/metadata-1.json
Snapshots
● /source/snap-1.avro
/source/snap-1.avro
Manifests
● /source/m1.avro
/source/snap-2.avro
Manifests
● /source/m1.avro
● /source/m2.avro
/source/manifest-1.avro
DataFile
● /source/data-1.parquet
/source/manifest-2.avro
DataFiles
● /source/data-2.parquet
/source/data-2.parquet
/source/data-1.parquet
RewriteTablePaths
(source_table)
Replace
/source/… => /target/…
Source_Table
Rewrite metadata
files to /staging
● /source =>
/target
Data Files are not
rewritten
Return
● Staged
Metadata File
Paths
● Data File Paths
/staging/metadata-2.json
MetadataLog
● /target/metadata-1.json
Snapshots
● /target/snap-1.avro
● /target/snap-2.avro
/staging/metadata-1.json
Snapshots
● /target/snap-1.avro
/staging/snap-1.avro
Manifests
● /target/m1.avro
/staging/snap-2.avro
Manifests
● /target/m1.avro
● /target/m2.avro
/staging/manifest-1.avro
DataFile
● /target/data-1.parquet
/staging/manifest-2.avro
DataFiles
● /target/data-1.parquet
● /target/data-2.parquet
/source/data-2.parquet
/source/data-1.parquet
DistCP
● Metadata Files
/staging => /target
● Data Files
/source => /target
RegisterTable
● /target/
metadata-2.json
● => target_table
/target/metadata-2.json
MetadataLog
● /target/metadata-1.json
Snapshots
● /target/snap-1.avro
● /target/snap-2.avro
/target/metadata-1.json
Snapshots
● /target/snap-1.avro
/target/snap-1.avro
Manifests
● /target/m1.avro
/target/snap-2.avro
Manifests
● /target/m1.avro
● /target/m2.avro
/target/manifest-1.avro
DataFile
● /target/data-1.parquet
/target/manifest-2.avro
DataFiles
● /target/data-2.parquet
/target/data-2.parquet
/target/data-1.parquet
Target_Table
How to keep a table in sync, if the source changes?
Problem: Incremental
Copy
RewriteTablePaths
(source_table)
start_version =
metadata1.json
end_version =
metadata2.json
Identify files added
between two
versions
/source/metadata-2.json
MetadataLog
● /source/metadata-1.json
Snapshots
● /source/snap-1.avro
● /source/snap-2.avro
/source/metadata-1.json
Snapshots
● /source/snap-1.avro
/source/snap-1.avro
Manifests
● /source/m1.avro
/source/snap-2.avro
Manifests
● /source/m1.avro
● /source/m2.avro
/source/manifest-1.avro
DataFile
● /source/data-1.parquet
/source/manifest-2.avro
DataFiles
● /source/data-2.parquet
/source/data-2.parquet
/source/data-1.parquet
Source_Table
Rewrite Metadata
Files in range
● /source =>
/target
Data Files NOT
rewritten
Returns
● Staged
Metadata File
Paths
● Data File Paths
in Range
/staging/metadata-2.json
MetadataLog
● /target/metadata-1.json
Snapshots
● /target/snap-1.avro
● /target/snap-2.avro
/staging/snap-2.avro
Manifests
● /target/m1.avro
● /target/m2.avro
/staging/manifest-2.avro
DataFiles
● /target/data-2.parquet
/source/data-2.parquet
Target Side
(Existing Table)
/target/metadata-1.json
Snapshots
● /target/snap-1.avro
/target/snap-1.avro
Manifests
● /target/m1.avro
/target/manifest-1.avro
DataFile
● /target/data-1.parquet
/target/data-1.parquet
Target_table
DistCP
● Metadata Files
/staging =>
/target
● Data Files
/source => /target
Register
(target_table)
Historic references in
target automatically
re-established
/target/metadata-2.json
MetadataLog
● /target/metadata-1.json
Snapshots
● /target/snap-1.avro
● /target/snap-2.avro
/target/metadata-1.json
Snapshots
● /target/snap-1.avro
/target/snap-1.avro
Manifests
● /target/m1.avro
/target/snap-2.avro
Manifests
● /target/m1.avro
● /target/m2.avro
/target/manifest-1.avro
DataFile
● /target/data-1.parquet
/target/manifest-2.avro
DataFiles
● /target/data-2.parquet
/target/data-2.parquet
/target/data-1.parquet
Target_table
Snapshot Expiration ?
Write-audit-publish ?
Branch and Tags ?
Rollback and reset ?
How about other Iceberg
Features?
Expire Snapshots
Common Cleanup
Operation
Deletes files
without references
in latest snapshots
Version files left
with invalid
snapshots
V0
()
V1
(S1)
Append
V2
(S1, S2)
V3
(S2)
Append Expire
S1 Files S2 Files
S2 Files
S2 Files
Error Check
RewriteTablePaths
disallow invalid
version between
start, end version
All references must
be valid
V1, V2 disallowed
V0
()
V1
(S1)
Append
V2
(S1, S2)
V3
(S2)
Append Expire
S1 Files S1 Files
S2 Files
S2 Files
Branching,
Rollback
These do not
delete files, so no
snapshots are
invalidated
(All versions may
be copied here)
How to handle concurrent modification to the table?
Problem: Concurrency
Problem: Concurrency
1. Expire snapshot run concurrently with RewriteTablePaths, DistCP
2. Files are deleted, causing failures
3. Solution: Schedule these jobs non-concurrently
How to bootstrap copy with PB worth of data?
Problem: Cold Start
Problem: Initial Copy
Too much data in initial copy (DistCP)
Solution
1. Semantically equivalent to copy + expire
2. Possible to filter source metadata.json during rewrite, for all snapshots
after X
3. Limit further rewrite to references from snapshots after X
4. But, all future incremental copies must also apply the same filter (else we
bring back snapshots unknown in target table)
/source/metadata.json
Snaphots:
[S1, S2, S3]
/target/metadata.json
Snaphots:
[S3]
1. RewriteTablePaths Spark
Action and
rewrite_table_paths
Procedure was released
in Iceberg 1.8.0
2. Referential integrity
check and register table
with override metadata is
WIP
Current State
Contribution Welcome!
1. Community interest for Hive to Iceberg migration with in-place
upgrade: https://github.com/apache/iceberg/issues/12762
2. Community interest for support relative path in Iceberg
metadata V4:
28

Incremental Iceberg Table Replication at Scale

  • 1.
    Incremental Iceberg Table Replication At Scale SzehonHo & Hongyue Zhang June 9–12, 2025
  • 2.
    Did You KnowIt’s Hard to Copy an Iceberg Table? Iceberg Table Spec chose Absolute Paths • That’s great, as there’s no ambiguity what files the table refers to 2 But, some things become hard • Migrating a table • Copying a table • Backup and Disaster Recovery
  • 3.
    Absolute Paths All referencesin Iceberg • Catalogs • Metadata.json (table version file) • Manifest-List (snapshot file) • Manifest File • Position Delete Files (V2) • Delete Vector Puffin Files (V3) Copying an Iceberg table to another locationwill break all these references. 3
  • 4.
    CREATE TABLE targetAS SELECT * FROM source; Expensive (Process all rows) Lose all snapshots and histories 4 Create Table As Select Drawbacks Workaround #1 (CTAS)
  • 5.
    1. Find allsource table’s current data files 2. DistCP all data files 3. Create target iceberg table 4. Use add-files Spark procedure to add copied parquet files to target table Delete Files broken (its file references are invalid) Broken schema and partition evolution Still lose all snapshots and histories 5 Copy Parquet Files Drawbacks Workaround #2 (Copy and Add files)
  • 6.
    1. Create Icebergtable with MRAP location 2. Enable cross-region replication Only supported AWS S3 buckets Fail-over to a secondary region require user action Replication lag up to 15 minutes Only for specific Disaster Recovery use case 6 Copy Parquet Files Drawbacks Workaround #3 (Multi-Region Access Point)
  • 7.
    Table Replication Workflow Rewriteall metadata files, replacing prefixes in absolute paths and save these to staging location External Copy Tool (ie, DistCP) to copy list of files (staging metadata files + original data files) from source to target Check referential Integrity before register latest copied metadata file in target catalog as new table Rewrite Copy Register 7 Source => Target in 3 steps
  • 8.
    RewriteTablePaths SparkAction Args =(Source Location, Target Location, Staging Location) 1. Rewrite all metadata files with absolute paths, replacing source => target 2. Save these to staging location Return Values 3. List of files to copy (staging metadata files + original data files) 4. Name of latest metadata.json rewritten (will be target table)
  • 9.
    Typical Table Commit Adds: ●New Version ● New Snapshot ● New Manifest ● New Data File References ● Down ● Back /source/metadata-2.json MetadataLog ● /source/metadata-1.json Snapshots ● /source/snap-1.avro ● /source/snap-2.avro /source/snap-1.avro Manifests ● /source/m1.avro /source/snap-2.avro Manifests ● /source/m1.avro ● /source/m2.avro /source/manifest-1.avro DataFile ● /source/data-1.parquet /source/manifest-2.avro DataFiles ● /source/data-2.parquet /source/data-2.parquet /source/data-1.parquet Source_Table /source/metadata-1.json Snapshots ● /source/snap-1.avro
  • 10.
    /source/metadata-2.json MetadataLog ● /source/metadata-1.json Snapshots ● /source/snap-1.avro ●/source/snap-2.avro /source/metadata-1.json Snapshots ● /source/snap-1.avro /source/snap-1.avro Manifests ● /source/m1.avro /source/snap-2.avro Manifests ● /source/m1.avro ● /source/m2.avro /source/manifest-1.avro DataFile ● /source/data-1.parquet /source/manifest-2.avro DataFiles ● /source/data-2.parquet /source/data-2.parquet /source/data-1.parquet RewriteTablePaths (source_table) Replace /source/… => /target/… Source_Table
  • 11.
    Rewrite metadata files to/staging ● /source => /target Data Files are not rewritten Return ● Staged Metadata File Paths ● Data File Paths /staging/metadata-2.json MetadataLog ● /target/metadata-1.json Snapshots ● /target/snap-1.avro ● /target/snap-2.avro /staging/metadata-1.json Snapshots ● /target/snap-1.avro /staging/snap-1.avro Manifests ● /target/m1.avro /staging/snap-2.avro Manifests ● /target/m1.avro ● /target/m2.avro /staging/manifest-1.avro DataFile ● /target/data-1.parquet /staging/manifest-2.avro DataFiles ● /target/data-1.parquet ● /target/data-2.parquet /source/data-2.parquet /source/data-1.parquet
  • 12.
    DistCP ● Metadata Files /staging=> /target ● Data Files /source => /target RegisterTable ● /target/ metadata-2.json ● => target_table /target/metadata-2.json MetadataLog ● /target/metadata-1.json Snapshots ● /target/snap-1.avro ● /target/snap-2.avro /target/metadata-1.json Snapshots ● /target/snap-1.avro /target/snap-1.avro Manifests ● /target/m1.avro /target/snap-2.avro Manifests ● /target/m1.avro ● /target/m2.avro /target/manifest-1.avro DataFile ● /target/data-1.parquet /target/manifest-2.avro DataFiles ● /target/data-2.parquet /target/data-2.parquet /target/data-1.parquet Target_Table
  • 13.
    How to keepa table in sync, if the source changes? Problem: Incremental Copy
  • 14.
    RewriteTablePaths (source_table) start_version = metadata1.json end_version = metadata2.json Identifyfiles added between two versions /source/metadata-2.json MetadataLog ● /source/metadata-1.json Snapshots ● /source/snap-1.avro ● /source/snap-2.avro /source/metadata-1.json Snapshots ● /source/snap-1.avro /source/snap-1.avro Manifests ● /source/m1.avro /source/snap-2.avro Manifests ● /source/m1.avro ● /source/m2.avro /source/manifest-1.avro DataFile ● /source/data-1.parquet /source/manifest-2.avro DataFiles ● /source/data-2.parquet /source/data-2.parquet /source/data-1.parquet Source_Table
  • 15.
    Rewrite Metadata Files inrange ● /source => /target Data Files NOT rewritten Returns ● Staged Metadata File Paths ● Data File Paths in Range /staging/metadata-2.json MetadataLog ● /target/metadata-1.json Snapshots ● /target/snap-1.avro ● /target/snap-2.avro /staging/snap-2.avro Manifests ● /target/m1.avro ● /target/m2.avro /staging/manifest-2.avro DataFiles ● /target/data-2.parquet /source/data-2.parquet
  • 16.
    Target Side (Existing Table) /target/metadata-1.json Snapshots ●/target/snap-1.avro /target/snap-1.avro Manifests ● /target/m1.avro /target/manifest-1.avro DataFile ● /target/data-1.parquet /target/data-1.parquet Target_table
  • 17.
    DistCP ● Metadata Files /staging=> /target ● Data Files /source => /target Register (target_table) Historic references in target automatically re-established /target/metadata-2.json MetadataLog ● /target/metadata-1.json Snapshots ● /target/snap-1.avro ● /target/snap-2.avro /target/metadata-1.json Snapshots ● /target/snap-1.avro /target/snap-1.avro Manifests ● /target/m1.avro /target/snap-2.avro Manifests ● /target/m1.avro ● /target/m2.avro /target/manifest-1.avro DataFile ● /target/data-1.parquet /target/manifest-2.avro DataFiles ● /target/data-2.parquet /target/data-2.parquet /target/data-1.parquet Target_table
  • 18.
    Snapshot Expiration ? Write-audit-publish? Branch and Tags ? Rollback and reset ? How about other Iceberg Features?
  • 19.
    Expire Snapshots Common Cleanup Operation Deletesfiles without references in latest snapshots Version files left with invalid snapshots V0 () V1 (S1) Append V2 (S1, S2) V3 (S2) Append Expire S1 Files S2 Files S2 Files S2 Files
  • 20.
    Error Check RewriteTablePaths disallow invalid versionbetween start, end version All references must be valid V1, V2 disallowed V0 () V1 (S1) Append V2 (S1, S2) V3 (S2) Append Expire S1 Files S1 Files S2 Files S2 Files
  • 21.
    Branching, Rollback These do not deletefiles, so no snapshots are invalidated (All versions may be copied here)
  • 22.
    How to handleconcurrent modification to the table? Problem: Concurrency
  • 23.
    Problem: Concurrency 1. Expiresnapshot run concurrently with RewriteTablePaths, DistCP 2. Files are deleted, causing failures 3. Solution: Schedule these jobs non-concurrently
  • 24.
    How to bootstrapcopy with PB worth of data? Problem: Cold Start
  • 25.
    Problem: Initial Copy Toomuch data in initial copy (DistCP) Solution 1. Semantically equivalent to copy + expire 2. Possible to filter source metadata.json during rewrite, for all snapshots after X 3. Limit further rewrite to references from snapshots after X 4. But, all future incremental copies must also apply the same filter (else we bring back snapshots unknown in target table) /source/metadata.json Snaphots: [S1, S2, S3] /target/metadata.json Snaphots: [S3]
  • 26.
    1. RewriteTablePaths Spark Actionand rewrite_table_paths Procedure was released in Iceberg 1.8.0 2. Referential integrity check and register table with override metadata is WIP Current State
  • 27.
    Contribution Welcome! 1. Communityinterest for Hive to Iceberg migration with in-place upgrade: https://github.com/apache/iceberg/issues/12762 2. Community interest for support relative path in Iceberg metadata V4:
  • 28.

Editor's Notes

  • #3 Explain a bit in detail
  • #4 Using a query engine, run create table as select * However, this is slower as all the rows need to be read, deserialized, serialized again, and written. Potentially even shuffled. What we want is a file level copy. You also lose all snapshots and histories associated with Iceberg metadatas that are not copied.
  • #5 A more optimal solution is to copy file by file. We use an external tool to copy all the files, then use an Iceberg procedure 'add files', which generates metadata files for a file. However, we have subtle problems by not copying all the original metadata. First, delete files and vectors are copied from the source, but their references still remain on the source side so they are broken. We also lose snapshots and histories like the first solution.
  • #6 For Disaster recovery, MRAP is an AWS solution that provides a wrapper around replicated buckets, and behind the scenes you can use the same path to access data from different buckets. So in a sense, it will make it so that all file path references are dynamic and can work for source and destination bucket. But there's a lot of issues with this. It's only supported for AWS S3, there are user actions to fail over, and replication lags mean that tables are not available synchronously. You may even have broken references are parts of snapshots are copied over, making the replicated Iceberg tabe not very atomic at all. It is also only for BDR case. We don't have a separate table in the target side.
  • #7 Table in Source Location => Table in Target Location Rewrite all metadata files, replacing prefixes in absolute paths Save these to staging location Find a list of files to copy (staging metadata files + original data files) External Copy Tool (ie, DistCP) from Source to Target Check Integrity (check all references are valid) Register latest copied metadata file in target catalog as new table
  • #13 RewriteTablePaths Incremental Mode (vs Full Mode) Optional ‘start_version’ and ‘end_version’ arguments Copy only files that are added in this range (between these version files)
  • #22 RewriteTablePaths Incremental Mode (vs Full Mode) Optional ‘start_version’ and ‘end_version’ arguments Copy only files that are added in this range (between these version files)
  • #24 RewriteTablePaths Incremental Mode (vs Full Mode) Optional ‘start_version’ and ‘end_version’ arguments Copy only files that are added in this range (between these version files)
  • #26 https://iceberg.apache.org/docs/nightly/spark-procedures/#table-replication
  • #27 https://iceberg.apache.org/docs/nightly/spark-proced ures/#table-replication