Learn more about Pantheon at the Developer Open House
Presented by Kyle Mathews and Josh Koenig
Thursday, February 14th, 12PM PST
Sign up: http://tinyurl.com/a3ofpc2
(Title background is "View of the Valhalla near Regensburg" from the Hermitage Museum.)
TeamStation AI System Report LATAM IT Salaries 2024
Valhalla at Pantheon
1. Valhalla at Pantheon
A Distributed File System Built
on Cassandra, Twisted Python, and FUSE
2. Pantheon's Requirements
● Density
○ Over 50K volumes in a single cluster
○ Over 1000 clients on a single application server
● Storage volume
○ Over 10TB in a single cluster
○ De-duplication of redundant data
● Throughput
○ Peaks during the U.S. business day and during site
imports and backups
● Performance
○ Back-end for Drupal web applications; access
has to be fast enough to not burden a web request
○ The applications won't be adapted from running on
local disk to running on Valhalla
3. Why not off-the-shelf?
● NFS
○ UID mapping requires trusted clients and networks
○ Standard Kerberos implementations have no HA
○ No cloud HA for client/server communication
● GlusterFS
○ Cannot scale volume density (though HekaFS can)
○ Cannot de-duplicate data
● Ceph
○ Security model relies on trusted clients
● MooseFS
○ Only primitive security
4. Valhalla's Design Manifesto
● Drupal applications read and write whole
files between 10KB and 10MB
○ And most reads hit the edge proxy cache
● Drupal tracks files in its database and has
little need for fstat() or directory listings
● POSIX compliance for locking and
permissions is unimportant
○ But volume-level access control is critical
● Volumes may contain up to 1MM files
● Availability and performance trump
consistency
6. Valhalla 1.0 Retrospective
● What worked
○ Efficient volume cloning
● What didn't
○ Slow computation of directory content when a
directory is small but contains a large subdirectory
■ Fix: Depth prefix for entries
○ Slow computation of file size
■ Fix: Denormalize metadata into directory entries
○ Problems replicating large files
■ Fix: Split files into chunks
8. Valhalla 2.0 Retrospective
● What worked
○ Version 1.0 issues fixed
● Problems to solve
○ Directory listings iterate over many columns
■ Fix: Cache complete PROPFIND responses
○ Single-threaded client bottlenecks
■ Fix: "Fast path" with direct HTTP from PHP and
proxied by Nginx
○ File content compaction eats up too much disk
■ Fix: "Offloading" cold and large content to S3
using iterative scripts and real-time decisions
10. Valhalla 3.0 Retrospective
● What worked
○ Version 2.0 issues fixed
● Problems to solve
○ Changes invalidate cached PROPFINDs, and then
clients do a PROPFIND
■ Fix: Extend schema and API to support volume
and directory event propagation
○ Single-threaded client still bottlenecks
■ Fix: New, multithreaded client
○ Client uses a write-invalidate cache
■ Fix: Move to a write-through/write-back model
11. Meanwhile, in backups
● Stopped using davfs2 file mounts
● New backup preparation algorithm
a. Backup builder downloads volume manifest
b. Iterates through each file and goes directly from S3
to the tarball
c. Any files not yet on S3 get pushed there by
requesting an "offload"
● Lower client overhead
● Lower server overhead
● Longer backup preparation time
13. Valhalla 4.0 Retrospective
● What worked
○ Version 3.0 issues fixed
● Problems to solve
○ Cloning volumes breaks the event stream
■ Fix: Invalidate events from before the volume
clone request
○ Clients receiving earlier copies of their own events
■ Fix: Only send clients events published by other
clients
○ Clients write a file and then have to re-download it
because of ETag limitations
■ Fix: Extend PUT to send ETag on response
○ Iteration through file content items times out
■ Fix: Iterate through local sstable keys
15. Implementing the Client Side
● Ditched davfs2
○ Single-threaded with only experimental patches to
multi-thread
○ Crufty code base designed to abstract FUSE versus
Coda
● Based code off of fusedav
○ Already multithreaded
○ Uses proven Neon WebDAV client
● Gutted cache
○ Needed fine-grained update capability for write-
through and write-back
○ Replaced with LevelDB
● Added in high-level FUSE operations
○ Atomic open+truncate, atomic create+open, etc.
16. Caching model
● LevelDB
○ Embeddable with low overhead
○ Iteratation without allocation management
○ Data model identical to single Cassandra row
○ Storage model similar to Cassandra sstables
○ Similar atomicity to row changes in Cassandra 1.1+
● Mirrored volume row locally
○ Including prefixes and metadata
○ May move to Merkel-based replication later
19. What's Next at Pantheon
● Move more toward a pNFS model
○ No file content storage in Cassandra (all in S3)
○ Peer-to-peer or other non-Cassandra file content
coordination between clients
● Peer-to-peer cache advisories between
clients
○ Less chatty server communication to poll events
○ Smaller window of incoherence (3s to <1s)
● Dropping the "fast path"
○ Client is already multithreaded
○ Client cache is smarter than direct Valhalla access
○ Minimizes incompatibility with Drupal
20. What's Next for the Community
● Finalize GPL-licensed FuseDAV client
○ Already public on GitHub
○ Public test suite with bundled server
○ Coordinate with existing FuseDAV users to make the
Pantheon version the official successor
● Publish WebDAV extensions and seek
standards acceptance
○ Progressive PROPFIND
○ ETag on PUT
21. David Strauss
● My groups
○ Drupal Association
○ Pantheon Systems
○ systemd/udev
● Get in touch
○ david@davidstrauss.net
○ @davidstrauss
○ facebook.com/warpforge
● Learn more about Pantheon
○ Developer Open House
○ Presented by Kyle Mathews and Josh Koenig
○ Thursday, February 14th, 12PM PST
○ Sign up: http://tinyurl.com/a3ofpc2