Original GoalProvide greater availability and durability with geographically distinct replicas. Multi-Region Replication • Replicate objects to other Swift clusters. • Allow a conﬁgurable number of remote replicas. • Ideally allow per container conﬁguration. Problems • Very complex to implement, the simpler feature I propose is already pretty complex. • Swift currently only has a cluster-wide replica count. • Tracking how many replicas are remote and where adds complexity. • Per container remote replica counts adds complexity. Complexity = More Time and More Bugs
New GoalProvide greater availability and durability with geographically distinct replicas. Simpler Container Synchronization • Replicate objects to other Swift clusters. • Remote replica count not conﬁgurable, it is the number of replicas the remote cluster is already conﬁgured for. • Per container conﬁguration allowed, but just "to where". Beneﬁts • Much simpler (but still complex). • Doesnt alter fundamental Swift internals. • Per container conﬁguration that doesnt change behavior, only the destination. • Side Beneﬁt: Can actually synchronize containers within the same cluster. (Migrating an account to another, for instance.) Simpler = Less Time and Fewer Bugs
How the User Would Use It1. Set the ﬁrst containers X-Container-Sync-To and X-Container-Sync-Key values; the To to the second containers URL and the Key made up: $ st post -t https://cluster2/v1/AUTH_gholt/container2 -k secret container12. Set the second containers X-Container-Sync-To and X-Container-Sync-Key values; the To to the ﬁrst containers URL and the Key the same made up value: $ st post -t https://cluster1/v1/AUTH_gholt/container1 -k secret container2 Now, any existing objects in the containers will be synced to one another as well as any additional objects.
Advanced Container Synchronization You can synchronize more than just two containers. Normally you just synchronize the two containers: Container 1 Container 2 But, you could synchronize more by using a chain: Container 1 Container 2 Container 3
Caveats• Valid X-Container-Sync-To destinations must be conﬁgured for each cluster ahead of time. The feature is based on Cluster Trust.• The Swift cluster clocks need to be set reasonably close to one other. Swift timestamps each operation and these timestamps are used in conﬂict resolution. For example, if an object is deleted on one cluster and overwritten on the other, whichever has the newest timestamp will win.• There needs to be enough bandwidth between the clusters to keep up with all the changes to the synchronized containers.• There will be a burst of bandwidth used when turning the feature on for an existing container full of objects.• A user has no explicit guarantee when a change will make it to the remote cluster. For example, a successful PUT means that cluster has the object, not the remote cluster. The synchronization happens in the background.• Does not sync object POSTs yet (more on this later).• Since background syncs come from the container servers themselves, they need to communicate with the remote cluster, probably requiring an HTTP proxy, and probably one per zone to avoid choke points.
What’s Left To Do? HTTP Proxying Tests Documentation POSTsBecause object POSTs dont currently cause a container database update, we need to either cause an update or come up with another way to synchronize them. The current plan is to modify POSTs to actually be a COPY internally. Downside: POSTs to large ﬁles will take longer. Upside: We have noticed very few POSTs in production.
Live Account Migrations This is a big step towards live account migrations.1. Turn on sync for the linked accounts on the two clusters.2. Wait for the new account to get caught up.3. Switch auth response URL to new account and revoke all existing account tokens.4. Put old account in a read-only mode.5. Turn off sync from the new account to the old.6. Wait until old account is no longer sending updates plus some safety time.7. Purge old account. Missing Pieces:• Account sync (creating new containers on both sides, deletes and posts too).• Account read-only mode.• Using alternate operator-only headers to not conﬂict with the users, also keeping the user from seeing or modifying the values.
Implementationst• Updated to set/read X-Container-Sync-To and X-Container-Sync-Key.Swauth and container-server• Requires a new conf value allowed_sync_hosts indicating the allowed remote clusters.swift-container-sync• New daemon that runs on every container server.• Scans every container database looking for ones with sync turned on.• Sends updates based on any new ROWIDs in the container database.• Keeps sync points in the local container databases of the last ROWIDs sent out.
Complexity - swift-container-syncThere are three container databases on different servers for each container.No need and quite wasteful for each to send all the updates.Easiest solution is to just have one send out the updates, but: • What if that one is down? • Couldnt synchronization be done faster if all three were involved?Instead, each sends a different third of the updates (assuming 3 replicas here). • Downside: If one is down, a third of the updates will be delayed until it comes back up.So, in addition, each node will send all older updates to ensure quicker synchronization. • Normally, each server does a third of the updates. • Each server also does all older updates for assurance. • The vast majority of assurance updates will short circuit.
In The Weeds• Two sync points are kept per container database.• All rows between the two sync points trigger updates. *• Any rows newer than both sync points cause updates depending on the nodes position for the container (primary nodes do one third, etc. depending on the replica count of course).• After a sync run, the ﬁrst sync point is set to the newest ROWID known and the second sync point is set to newest ROWID for which all updates have been sent. * This is a slight lie. It actually only needs to send the two-thirds of updates it isnt primarily responsible for since it knows it already sent the other third.
In The Weeds An example may help. Assume replica count is 3 and perfectly matching ROWIDs starting at 1. First sync run, database has 6 rows:• SyncPoint1 starts as -1.• SyncPoint2 starts as -1.• No rows between points, so no "all updates" rows.• Six rows newer than SyncPoint1, so a third of the rows are sent by node 1, another third by node 2, remaining third by node 3.• SyncPoint1 is set as 6 (the newest ROWID known).• SyncPoint2 is left as -1 since no "all updates" rows were synced.
In The Weeds Next sync run, database has 12 rows:• SyncPoint1 starts as 6.• SyncPoint2 starts as -1.• The rows between -1 and 6 all trigger updates (most of which should short-circuit on the remote end as having already been done).• Six more rows newer than SyncPoint1, so a third of the rows are sent by node 1, another third by node 2, remaining third by node 3.• SyncPoint1 is set as 12 (the newest ROWID known).• SyncPoint2 is set as 6 (the newest "all updates" ROWID). In this way, under normal circumstances each node sends its share of updates each run and just sends a batch of older updates to ensure nothing was missed.
Extras• swift-container-sync can be conﬁgured to only spend x amount of time trying to sync a given container -- avoids one crazy container starving out all others.• A crash of a container server means lost container database copies that will be replaced by one of the remaining copies on the other servers. The reestablished server will get the sync points from the copy, but no updates will be lost due to the "all updates" algorithm the other two followed.• Rebalancing the container ring moves container database copies around, but results in the same behavior as a crashed server would.• For bidirectional sync setups, the receiver will send the sender back the updates (though they will short-circuit). Only way I can think to prevent that is to track where updates were received from (X-Loop) but thats expensive. Anything Else? email@example.com http://tlohg.com/