Openstack Swift - Lots of small files

Object storage optimization in Swift
Alexandre LECUYER
DevOps / irc: alecuyer
Romain LE DISEZ
DevOps / irc: rledisez

What’s the problem?
• Performance is bad
• Disks 100% busy
• Replication/reconstruction is very (very) slow
2

Replica in Swift
3
/srv/node/<device>/objects/<partition>/<suffix>/<hash>/<timestamp>.data
012345
012345
012345
012345

Erasure Coding in Swift
4
012345
03
14
25
/srv/node/<device>/objects-1/<partition>/<suffix>/<hash>/<timestamp>#<fragment>#d.data
a9

Comparison
• Replica:
– Performance
– Overhead
– 3 files per object
(3 replicas)
• Erasure coding
– Cost effective
– Slow-ish
– 15 files per object
(12+3 fragments)
5

Where inodes join the party…
• XFS:
– one inode per file
– one inode per directory
• Inode:
– ctime/mtime/atime
– owner/group
– Permissions
6

Bad things happen
• One inode takes 300 bytes to 1k of memory
• Average: 2.4 inodes per fragment
– Data file: 1
– Object directory: 1
– Suffix directory + Partition directory: 0.4
7

Memory issues
• Inodes cannot fit in cache anymore
– But every inode of the path must be checked to
open a data file
• Only top level directories are cached
– Only 20% of hit on inode cache
– Up to 50% of devices activity to read inodes
8

Stability issues
• More filesystem corruptions
• Inability to run xfs_repair
– 1K of memory per inode
• Need a dedicated servers just to repair filesystems
– About 48 hours to repair one filesystem
9

Let’s fix it!
(a.k.a. inodes are useless, right?)
10

We tried crazy things
• Storing objects in a K/V (RocksDB, LevelDB, …)
– Not suited to synchronous IO. Write amplification.
• Storing in a K/V the file handle of datafiles
– Atomicity on two separate data structures
• Patching XFS to drop useless information
– It’s already well optimized, inodes may be compressed
• Storing in ZFS DMU
– Lots of very cool features, but performance issues if full, low
level development
11

12
Object Header
Volume Header
Object Data
Object Header
Object Data
Store multiple objects in
large files

13
Object Header
Volume Header
Object Data
Object Header
Object Data
Dedicated to a partition
No concurrent writes
Append only

Swift request path
14
Proxy server
Proxy server
Object server Object server Object server
PUT / GET requests

How does Swift organize data ?
• PUT: « photo.jpg » -> MD5 hash:
bc6a624f493bf3042662064285f355c4
• Partition : bc6a -> 48234
• Suffix : 5c4
• Timestamp : 1449519086.42102.data
• /srv/node/sda/objects/48234/5c4/bc6a624f493b
f3042662064285f355c4/1449519086.42102.data
15

Example : writing an object
16
Proxy server Object server Index server
Volume Volume Volume
Obtain a write lock on a volume (fcntl)
Write the object at the end of the volume
Register the objectPUT

Example : reading an object
17
Proxy server Object server Index server
Volume Volume Volume
Open the volume
Read the object at the given offset
Get object locationGET

Index server
• Stores data in a key/value store : LevelDB
• Communication with gRPC
• Key : hash + filename
• Value : volume index + offset
• Keys are sorted on-disk for efficient seeks
18

Index server – keys example
• ……
• bc6a46b909cf7a8e9529fac36f0669e31475194591.74265.data
• bc6a624f493bf3042662064285f355c41449519086.42102.data
• bc6b78b325b81b28fcfcdaef49dc87d11415965115.56792.data
• ……
19

What about directories ?
20
• bc6a46b909cf7a8e9529fac36f0669e31475194591.74265.data
• bc6a624f493bf3042662064285f355c41449519086.42102.data
• bc6b78b325b81b28fcfcdaef49dc87d11415965115.56792.data
48234
48235
9e3
5c4
7d1
bc6a46b... 1475194591.74265.data
bc6a624...
bc6b78b…
1449519086.42102.data
1415965115.56792.data

Deletion - Hole punching
21https://en.wikipedia.org/wiki/Sparse_file#/media/File:Sparse_file_(en).svg

Deletion
• Hole-punching with fallocate()
• Reclaim space without
changing the file size!
22
Object Header
Volume Header
Object Data
Object Header
Object Data
Space reclaimed by the filesystem

Implementation overview
23
Swift code,
patched.
diskfile.py
Index server,
with levelDB as
the backing key-
value store
gRPC
vfile.py
module

vfile.py
• Provides a file like interface
• f = vfile.open(« /path/to/file »)
• f.read()
• vfile.listdir(« /srv/node/<disk>/<partition>/ »)
24

Managing fragmentation
Dedicated volumes for short lived files
25
Volume
Volume
Volume
Volume
Volume
Volume
« .data » files « .ts » files

Write performance
• We cannot afford two synchronous writes
• The large file write is synchronous (fdatasync)
• The large file is preallocated
• K/V writes are asynchronous
26

Recovery
• Scan the volumes backwards
• Add missing information to the key value
27

How does it perform ?
• Bytes per objects in K/V : 42 bytes
• Latency : slightly worse when empty, much
better when full
• REPLICATE : served from memory
• Saved space
• Room for improvement
28

Benchmarks
• PUT single thread
– XFS: 17/s
– Volumes: 40/s
• PUT 20 threads
– XFS: 4.7s (99%)
– Volumes: 615ms
(99%)
29
• GET
– XFS: 39/s
– Volumes: 93/s

What’s next
• Upstream
• Store short-lived objects in dedicated volumes
• Replication of volumes
• Choose replica/erasure-coding on the fly
30

Credits
• Haystack (Facebook project)
• Openstack Swift community
31

Metadata storage
• (extra slide if time)
• Previously stored as extended attributes
• Now serialized with protobuf and stored in the
volume
33

Openstack Swift - Lots of small files

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Openstack Swift - Lots of small files

Similar to Openstack Swift - Lots of small files (20)

Recently uploaded

Recently uploaded (20)

Openstack Swift - Lots of small files

Editor's Notes