Valhalla at Pantheon


Published on

Learn more about Pantheon at the Developer Open House
Presented by Kyle Mathews and Josh Koenig
Thursday, February 14th, 12PM PST
Sign up:

(Title background is "View of the Valhalla near Regensburg" from the Hermitage Museum.)

Published in: Technology
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Valhalla at Pantheon

  1. Valhalla at Pantheon A Distributed File System Builton Cassandra, Twisted Python, and FUSE
  2. Pantheons Requirements● Density ○ Over 50K volumes in a single cluster ○ Over 1000 clients on a single application server● Storage volume ○ Over 10TB in a single cluster ○ De-duplication of redundant data● Throughput ○ Peaks during the U.S. business day and during site imports and backups● Performance ○ Back-end for Drupal web applications; access has to be fast enough to not burden a web request ○ The applications wont be adapted from running on local disk to running on Valhalla
  3. Why not off-the-shelf?● NFS ○ UID mapping requires trusted clients and networks ○ Standard Kerberos implementations have no HA ○ No cloud HA for client/server communication● GlusterFS ○ Cannot scale volume density (though HekaFS can) ○ Cannot de-duplicate data● Ceph ○ Security model relies on trusted clients● MooseFS ○ Only primitive security
  4. Valhallas Design Manifesto● Drupal applications read and write whole files between 10KB and 10MB ○ And most reads hit the edge proxy cache● Drupal tracks files in its database and has little need for fstat() or directory listings● POSIX compliance for locking and permissions is unimportant ○ But volume-level access control is critical● Volumes may contain up to 1MM files● Availability and performance trump consistency
  5. volumes content_by_file /d1/ /d1/f1.txt /d1/d3/ /d1/d3/f2.txt contentvol1 ... ade12... ade12... c12bea... binary /dir1/ /dir1/file.txt /dir1/f2.txt contentvol2 ... c12bea... ade12... c12bea... binary /dir3/ /dir3/f3.txt /dir3/f2.txt contentvol3 ... 13a8cd... 13a8cd... c12bea... binary ... ... Valhalla 1.0
  6. Valhalla 1.0 Retrospective● What worked ○ Efficient volume cloning● What didnt ○ Slow computation of directory content when a directory is small but contains a large subdirectory ■ Fix: Depth prefix for entries ○ Slow computation of file size ■ Fix: Denormalize metadata into directory entries ○ Problems replicating large files ■ Fix: Split files into chunks
  7. volumes content_by_file 1:/d1/ 1:/d1/f1.txt 1:/d1/d3/ 2:/d1/d3/f2.txt 0 1vol1 {"size": 1243, {"size": 111, ... ade12... "hash": ade12... "hash": c12bea... binary binary 1:/dir1/ 1:/dir1/file.txt 1:/dir1/f2.txt 0vol2 ... c12bea... {"size": 1243, {"size": 111, binary "hash": ade12... "hash": c12bea... 1:/dir3/ 1:/dir3/f3.txt 1:/dir3/f2.txt 0 1 2vol3 ... 13a8cd... {"size": 5243, {"size": 111, binary binary binary "hash": 13a8cd... "hash": c12bea... ... ... Valhalla 2.0
  8. Valhalla 2.0 Retrospective● What worked ○ Version 1.0 issues fixed● Problems to solve ○ Directory listings iterate over many columns ■ Fix: Cache complete PROPFIND responses ○ Single-threaded client bottlenecks ■ Fix: "Fast path" with direct HTTP from PHP and proxied by Nginx ○ File content compaction eats up too much disk ■ Fix: "Offloading" cold and large content to S3 using iterative scripts and real-time decisions
  9. listing_cache Unchanged content_by_file /dir1/ /dir2/vol1 binary binary ... /dir1/ volumesvol2 binary ... /d1/ /d1/d2/ /d3/vol3 binary binary binary ... Valhalla 3.0
  10. Valhalla 3.0 Retrospective● What worked ○ Version 2.0 issues fixed● Problems to solve ○ Changes invalidate cached PROPFINDs, and then clients do a PROPFIND ■ Fix: Extend schema and API to support volume and directory event propagation ○ Single-threaded client still bottlenecks ■ Fix: New, multithreaded client ○ Client uses a write-invalidate cache ■ Fix: Move to a write-through/write-back model
  11. Meanwhile, in backups● Stopped using davfs2 file mounts● New backup preparation algorithm a. Backup builder downloads volume manifest b. Iterates through each file and goes directly from S3 to the tarball c. Any files not yet on S3 get pushed there by requesting an "offload"● Lower client overhead● Lower server overhead● Longer backup preparation time
  12. events Unchanged content_by_file t=1 t=2vol1:/dir1/ {"path": "/dir2/","event": {"path": "/dir2/f2.txt","event": "CREATED"... "CREATED"... ... t=5 t=6 volumesvol1:/dir2/ {"path": "/dir5/","event": {"path": "/dir6/","event": "CREATED"... "CREATED"... ... t=5 t=6 listing_cachevol3:/d1/d2/ {"path": "f3.txt","event": {"path": "f3.txt","event": "CREATED"... "DESTROYED"... ... ... Valhalla 4.0
  13. Valhalla 4.0 Retrospective● What worked ○ Version 3.0 issues fixed● Problems to solve ○ Cloning volumes breaks the event stream ■ Fix: Invalidate events from before the volume clone request ○ Clients receiving earlier copies of their own events ■ Fix: Only send clients events published by other clients ○ Clients write a file and then have to re-download it because of ETag limitations ■ Fix: Extend PUT to send ETag on response ○ Iteration through file content items times out ■ Fix: Iterate through local sstable keys
  14. volume_metadata Unchanged content_by_file rewrittenvol1 t=3 ... volumesvol2 ... rewritten listing_cachevol3 t=2 ... ... events ... Valhalla 4.5
  15. Implementing the Client Side● Ditched davfs2 ○ Single-threaded with only experimental patches to multi-thread ○ Crufty code base designed to abstract FUSE versus Coda● Based code off of fusedav ○ Already multithreaded ○ Uses proven Neon WebDAV client● Gutted cache ○ Needed fine-grained update capability for write- through and write-back ○ Replaced with LevelDB● Added in high-level FUSE operations ○ Atomic open+truncate, atomic create+open, etc.
  16. Caching model● LevelDB ○ Embeddable with low overhead ○ Iteratation without allocation management ○ Data model identical to single Cassandra row ○ Storage model similar to Cassandra sstables ○ Similar atomicity to row changes in Cassandra 1.1+● Mirrored volume row locally ○ Including prefixes and metadata ○ May move to Merkel-based replication later
  17. Benchmarks versus Local and Older Models
  18. Benchmarks versus Local and Older Models
  19. Whats Next at Pantheon● Move more toward a pNFS model ○ No file content storage in Cassandra (all in S3) ○ Peer-to-peer or other non-Cassandra file content coordination between clients● Peer-to-peer cache advisories between clients ○ Less chatty server communication to poll events ○ Smaller window of incoherence (3s to <1s)● Dropping the "fast path" ○ Client is already multithreaded ○ Client cache is smarter than direct Valhalla access ○ Minimizes incompatibility with Drupal
  20. Whats Next for the Community● Finalize GPL-licensed FuseDAV client ○ Already public on GitHub ○ Public test suite with bundled server ○ Coordinate with existing FuseDAV users to make the Pantheon version the official successor● Publish WebDAV extensions and seek standards acceptance ○ Progressive PROPFIND ○ ETag on PUT
  21. David Strauss● My groups ○ Drupal Association ○ Pantheon Systems ○ systemd/udev● Get in touch ○ ○ @davidstrauss ○● Learn more about Pantheon ○ Developer Open House ○ Presented by Kyle Mathews and Josh Koenig ○ Thursday, February 14th, 12PM PST ○ Sign up: