Sachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
Slide smallfiles
1. Issue
● 10-20 Millions object per devices
– 50 millions inodes per devices
● 36 devices per server
● 64 GB of RAM
– 1 inode is 1KB in RAM
– Would need 1.75TB of RAM for caching all inodes
● 75 % cache miss on inodes
– Up to 50 % of IO to get inodes from device
– (replicator/reconstructor constantly scan device...)
2. Solution
● Get rid of inodes
● Haystack-like solution
– Objects in volumes (a.k.a. big files, 5GB or 10GB)
– K/V store to map object to (volume id, position)
● K/V is an gRPC service
● Backed by LevelDB (for now...)
● Need to avoid compaction issue
– fallocate(PUNCH_HOLE)
– Smart selection of volumes
3. Benefits
● 42 bytes per object in K/V
– Compared to 1KB for an XFS inode
– Fit in memory (20GB vs 1.75TB)
– Should easily go down to 30 bytes per object
● Listdir happens in K/V (so in memory)
● Space efficiency vs Block aligned (!)
● Flat namespace for objects
– No part/sfx/ohash
– Increasing part power is just a ring thing
4. Adding an object
1.Select a volume
2.Append objet data
1.Object header (magic string, ohash, size, …)
2.Object metadata
3.Object data
3.fdatasync() volume
4.Insert new entry in K/V (no transaction)
● <o><policy><ohash><filename> => <volume id><offset>
=> If crash, the volume act as a journal to replay
5. Removing an object
1.Select a volume
2.Insert a tombstone
3.fdatasync() volume
4.Insert tombstone in K/V
5.Run cleanup_ondisk_files()
1.Punch_hole the object
2.Remove the old entry from K/V
6. Volume selection
● Avoid holes in volumes to reduce compaction
– Try to group objects by partition
● => rebalance is compaction
– Put short life objects in dedicated volumes
● tombstone
● x-delete-at soon
– Dedicated volumes for handoff?
7. Benchmarks
● Atom C2750 2.40Ghz
● 16GB RAM
● HGST HUS726040ALA610 (4TB)
● Directly connecting to objet servers
8. Benchmarks
● Single threaded PUT (100 bytes objects)
– From 0 to 4 millions objects
● XFS : 19.8/s
● Volumes : 26.2/s
– From 4 millions to 8 millions objects
● XFS : 17/s
● Volumes : 39.2/s (b/c of not creating more volumes?)
● What we see (need numbers!)
– XFS : memory is full ; Volumes : memory is free
– Disks is more busy with XFS