Incremental Iceberg Table Replication at Scale

Incremental
Iceberg Table
Replication At
Scale
Szehon Ho & Hongyue Zhang
June 9–12, 2025

Did You Know It’s Hard to Copy an Iceberg
Table?
Iceberg Table Spec chose Absolute
Paths
• That’s great, as there’s no
ambiguity what files the table
refers to
2
But, some things become hard
• Migrating a table
• Copying a table
• Backup and Disaster Recovery

Absolute Paths
All references in Iceberg
• Catalogs
• Metadata.json (table version file)
• Manifest-List (snapshot file)
• Manifest File
• Position Delete Files (V2)
• Delete Vector Puffin Files (V3)
Copying an Iceberg table to another
locationwill break all these
references.
3

CREATE TABLE target AS
SELECT *
FROM source;
Expensive (Process all rows)
Lose all snapshots and histories
4
Create Table As Select Drawbacks
Workaround #1 (CTAS)

1. Find all source table’s current
data files
2. DistCP all data files
3. Create target iceberg table
4. Use add-files Spark procedure
to add copied parquet files to
target table
Delete Files broken (its file
references are invalid)
Broken schema and partition
evolution
Still lose all snapshots and
histories
5
Copy Parquet Files Drawbacks
Workaround #2 (Copy and Add files)

1. Create Iceberg table with
MRAP location
2. Enable cross-region
replication
Only supported AWS S3 buckets
Fail-over to a secondary region
require user action
Replication lag up to 15 minutes
Only for specific Disaster
Recovery use case
6
Copy Parquet Files Drawbacks
Workaround #3 (Multi-Region Access Point)

Table Replication Workflow
Rewrite all metadata files,
replacing prefixes in absolute
paths and save these to staging
location
External Copy Tool (ie, DistCP)
to copy list of files (staging
metadata files + original data
files) from source to target
Check referential Integrity
before register latest copied
metadata file in target catalog
as new table
Rewrite Copy Register
7
Source => Target in 3 steps

RewriteTablePaths SparkAction
Args = (Source Location, Target Location, Staging Location)
1. Rewrite all metadata files with absolute paths, replacing source => target
2. Save these to staging location
Return Values
3. List of files to copy (staging metadata files + original data files)
4. Name of latest metadata.json rewritten (will be target table)

Typical Table
Commit Adds:
● New Version
● New Snapshot
● New Manifest
● New Data File
References
● Down
● Back
/source/metadata-2.json
MetadataLog
● /source/metadata-1.json
Snapshots
● /source/snap-1.avro
/source/snap-1.avro
Manifests
● /source/m1.avro
/source/snap-2.avro
Manifests
● /source/m1.avro
● /source/m2.avro
/source/manifest-1.avro
DataFile
● /source/data-1.parquet
DataFiles
/source/data-2.parquet
Source_Table
Snapshots

MetadataLog
Snapshots
Snapshots
/source/snap-1.avro
Manifests
● /source/m1.avro
/source/snap-2.avro
Manifests
● /source/m1.avro
● /source/m2.avro
DataFile
DataFiles
RewriteTablePaths
(source_table)
Replace
/source/… => /target/…
Source_Table

Rewrite metadata
files to /staging
● /source =>
/target
Data Files are not
rewritten
Return
● Staged
Metadata File
Paths
● Data File Paths
/staging/metadata-2.json
MetadataLog
● /target/metadata-1.json
Snapshots
● /target/snap-1.avro
Snapshots
/staging/snap-1.avro
Manifests
● /target/m1.avro
Manifests
● /target/m1.avro
● /target/m2.avro
/staging/manifest-1.avro
DataFile
● /target/data-1.parquet
DataFiles

DistCP
● Metadata Files
/staging => /target
● Data Files
/source => /target
RegisterTable
● /target/
metadata-2.json
● => target_table
/target/metadata-2.json
MetadataLog
Snapshots
Snapshots
/target/snap-1.avro
Manifests
● /target/m1.avro
/target/snap-2.avro
Manifests
● /target/m1.avro
● /target/m2.avro
/target/manifest-1.avro
DataFile
DataFiles
/target/data-2.parquet
Target_Table

How to keep a table in sync, if the source changes?
Problem: Incremental
Copy

RewriteTablePaths
(source_table)
start_version =
metadata1.json
end_version =
metadata2.json
Identify files added
between two
versions
MetadataLog
Snapshots
Snapshots
/source/snap-1.avro
Manifests
● /source/m1.avro
/source/snap-2.avro
Manifests
● /source/m1.avro
● /source/m2.avro
DataFile
DataFiles
Source_Table

Rewrite Metadata
Files in range
● /source =>
/target
Data Files NOT
rewritten
Returns
● Staged
Metadata File
Paths
● Data File Paths
in Range
MetadataLog
Snapshots
Manifests
● /target/m1.avro
● /target/m2.avro
DataFiles

Target Side
(Existing Table)
Snapshots
/target/snap-1.avro
Manifests
● /target/m1.avro
DataFile
Target_table

DistCP
● Metadata Files
/staging =>
/target
● Data Files
/source => /target
Register
(target_table)
Historic references in
target automatically
re-established
MetadataLog
Snapshots
Snapshots
/target/snap-1.avro
Manifests
● /target/m1.avro
/target/snap-2.avro
Manifests
● /target/m1.avro
● /target/m2.avro
DataFile
DataFiles
Target_table

Snapshot Expiration ?
Write-audit-publish ?
Branch and Tags ?
Rollback and reset ?
How about other Iceberg
Features?

Expire Snapshots
Common Cleanup
Operation
Deletes files
without references
in latest snapshots
Version files left
with invalid
snapshots
V0
()
V1
(S1)
Append
V2
(S1, S2)
V3
(S2)
Append Expire
S1 Files S2 Files
S2 Files
S2 Files

Error Check
RewriteTablePaths
disallow invalid
version between
start, end version
All references must
be valid
V1, V2 disallowed
V0
()
V1
(S1)
Append
V2
(S1, S2)
V3
(S2)
Append Expire
S1 Files S1 Files
S2 Files
S2 Files

Branching,
Rollback
These do not
delete files, so no
snapshots are
invalidated
(All versions may
be copied here)

How to handle concurrent modification to the table?
Problem: Concurrency

Problem: Concurrency
1. Expire snapshot run concurrently with RewriteTablePaths, DistCP
2. Files are deleted, causing failures
3. Solution: Schedule these jobs non-concurrently

How to bootstrap copy with PB worth of data?
Problem: Cold Start

Problem: Initial Copy
Too much data in initial copy (DistCP)
Solution
1. Semantically equivalent to copy + expire
2. Possible to filter source metadata.json during rewrite, for all snapshots
after X
3. Limit further rewrite to references from snapshots after X
4. But, all future incremental copies must also apply the same filter (else we
bring back snapshots unknown in target table)
/source/metadata.json
Snaphots:
[S1, S2, S3]
/target/metadata.json
Snaphots:
[S3]

1. RewriteTablePaths Spark
Action and
rewrite_table_paths
Procedure was released
in Iceberg 1.8.0
2. Referential integrity
check and register table
with override metadata is
WIP
Current State

Contribution Welcome!
1. Community interest for Hive to Iceberg migration with in-place
upgrade: https://github.com/apache/iceberg/issues/12762
2. Community interest for support relative path in Iceberg
metadata V4:

Incremental Iceberg Table Replication at Scale

More Related Content

Similar to Incremental Iceberg Table Replication at Scale

More from Szehon Ho

Recently uploaded

Incremental Iceberg Table Replication at Scale

Editor's Notes