Sharding: Past, Present and Future with Krutika Dhananjay

Sharding
Past, Present and Future
Krutika Dhananjay - kdhananj@redhat.com
Software Engineer

PAST
Alas! Sharding has no past!
Striping

What is striping?
● Client-side translator - sits below DHT
● There will be <stripe-count> copies of every striped file
● Each file split into <block-size> chunks.
● Consecutive chunks are spread across multiple piece files
in a round-robin fashion.

Unstriped file
1
2
3
4
5
6
7
8
9
10
1
4
7
10
2
5
8
3
6
9
After
striping
stripe-subvol-2stripe-subvol-1stripe-subvol-0
Stripe-count = 3
<num> 1 chunk of size
<stripe-block-size>
-
Striping in Action

Stripe translator - shortcomings
● Cost - you can add servers only in multiples of
‘stripe-count * replica-count’.
● File splitting not granular enough
○ Self-heal of a striped file should still heal
‘total_file_size/stripe_count’ bytes of data.
○ Geo-replication of a striped file should still sync
‘total_file_size/stripe_count’ bytes of data

Stripe shortcomings contd ...
● Suboptimal utilization of disks
○ An ‘x’ TB sized file would still require at least
‘x/stripe-count’ amount of space available in any
subvolume of DHT
○ … in turn implies suboptimal distribution of IOPs
across bricks for a given file.

Present - Sharding for
VM Image Storage

What is sharding?
● Client-side xlator – sits above DHT
● Splits file into equal -sized chunks as it grows in size
● Shards beyond first block kept in a hidden /.shard
directory and the first block under its parent dir
● Translators above shard only see the user files
● Translators below shard see shards as normal files
● Shard naming is <gfid>.<num>
● Shard size configurable at volume level - 4MB to 4TB

FUSE/ gfapi / other protocol
io-stats
write-behind
shard
DHT
AFR-0
protocol/client-1protocol/client-0 protocol/client-2 protocol/client-3
AFR-1

Short Demo
https://asciinema.org/a/brvrvh2fhhl7djlpboz74y4ll

How sharding benefits the use case
● Granularity of data heal is at shard level
● Minimal resource utilization by background processes
(self-heal, geo-rep, etc)
● VM image size no longer limited by the capacity of
individual brick(s)
● Better distribution of IOPs across bricks
● Geo-rep can now operate at shard level
● Add new bricks only after existing bricks’ space is fully
utilized

Where is the file metadata stored?
● File permissions, ownership, aggregated file size, block-count
and user-set extended attributes only maintained on the
base file. Shards under “.shard” owned by root.
● Being that sharding is only used in single-writer use case,
mtime is maintained on a best-effort basis in memory and
kept up-to-date as individual shards witness writes.
Moral of the story - lookup, stat, {get,set,remove}xattr are directly
served from the base file => 1 network call.

How does writing to a sharded file work?
● Create ‘.shard’ if it doesn’t exist.
● Identify participant shards, given write offset and length.
● Create shards if non-existent, in parallel.
● Send writes on participant shards at appropriate shard
offsets in parallel.
● Once all write responses are received, update size and
block count through an xattrop operation.
● Update in-memory cache containing the file size and
block-count and unwind the call.

How are renames and hard-links handled?
● Both fops operate only on the base file.
● File’s gfid remains constant even after a rename =>shards
under ‘.shard’ don’t need to be renamed.
● In other words, renaming and hard-linking a sharded file
completes in one single (atomic) network call.

Interoperability with existing Gluster
features?
● Verified that it works fine with geo-replication, hence
supported
● It should “theoretically” work fine in its current state with
features such as bit-rot detection, tiering etc because of its
position in the stack
● Features that won’t readily work with sharding (at least
not without additional code changes) - quota, snapshots,
etc.

FUTURE - Sharding for
general purpose use cases

● Classic trade-off between consistency and performance
● For performance
○ Maximise parallelism across non-overlapping regions
of the large file
● For consistency
○ Keep writes atomic
○ Keep file size and block-count updates atomic and
accurate
○ mtime should reflect highest value
○ Handle truncates and appending writes correctly
Main challenges

The idea so far ...
● Do not try to solve fault tolerance and
recovery.
○ Use replication!
● Avoid locking as far as possible.
○ Do away with locks for writes do not span
across more than one shard
○ Use locking only for writes that modify multiple
shards to prevent interleaving of multiple
parallel writes
○ Introduce common locking framework to
minimize impact of locking by multiple
translators, on performance.

● Introduce a server-side translator to
manage size update
○ Eliminate the need to take locks over the
network for size update.
● Store ctime/mtime in the form of an xattr
on the base file
○ This is a generic problem that needs to be
solved across multiple translators
● Possibly leverage compound fops
● Bitmaps for counting blocks?
The idea so far ...

Sharding: Past, Present and Future with Krutika Dhananjay

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Sharding: Past, Present and Future with Krutika Dhananjay

Similar to Sharding: Past, Present and Future with Krutika Dhananjay (20)

More from Gluster.org

More from Gluster.org (20)

Recently uploaded

Recently uploaded (20)

Sharding: Past, Present and Future with Krutika Dhananjay