SlideShare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our User Agreement and Privacy Policy.
SlideShare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our Privacy Policy and User Agreement for details.
Successfully reported this slideshow.
Activate your 14 day free trial to unlock unlimited reading.
Learn more about Pantheon at the Developer Open House
Presented by Kyle Mathews and Josh Koenig
Thursday, February 14th, 12PM PST
Sign up: http://tinyurl.com/a3ofpc2
(Title background is "View of the Valhalla near Regensburg" from the Hermitage Museum.)
CTO/co-founder of Pantheon; systemd co-maintainer; Drupal Security Team member; founding member of 1st Fedora Server WG
Learn more about Pantheon at the Developer Open House
Presented by Kyle Mathews and Josh Koenig
Thursday, February 14th, 12PM PST
Sign up: http://tinyurl.com/a3ofpc2
(Title background is "View of the Valhalla near Regensburg" from the Hermitage Museum.)
Pantheon's Requirements ● Density ○
Over 50K volumes in a single cluster ○ Over 1000 clients on a single application server ● Storage volume ○ Over 10TB in a single cluster ○ De-duplication of redundant data ● Throughput ○ Peaks during the U.S. business day and during site imports and backups ● Performance ○ Back-end for Drupal web applications; access has to be fast enough to not burden a web request ○ The applications won't be adapted from running on local disk to running on Valhalla
Why not off-the-shelf? ● NFS
○ UID mapping requires trusted clients and networks ○ Standard Kerberos implementations have no HA ○ No cloud HA for client/server communication ● GlusterFS ○ Cannot scale volume density (though HekaFS can) ○ Cannot de-duplicate data ● Ceph ○ Security model relies on trusted clients ● MooseFS ○ Only primitive security
Valhalla's Design Manifesto ● Drupal
applications read and write whole files between 10KB and 10MB ○ And most reads hit the edge proxy cache ● Drupal tracks files in its database and has little need for fstat() or directory listings ● POSIX compliance for locking and permissions is unimportant ○ But volume-level access control is critical ● Volumes may contain up to 1MM files ● Availability and performance trump consistency
Valhalla 1.0 Retrospective ● What
worked ○ Efficient volume cloning ● What didn't ○ Slow computation of directory content when a directory is small but contains a large subdirectory ■ Fix: Depth prefix for entries ○ Slow computation of file size ■ Fix: Denormalize metadata into directory entries ○ Problems replicating large files ■ Fix: Split files into chunks
Valhalla 2.0 Retrospective ● What
worked ○ Version 1.0 issues fixed ● Problems to solve ○ Directory listings iterate over many columns ■ Fix: Cache complete PROPFIND responses ○ Single-threaded client bottlenecks ■ Fix: "Fast path" with direct HTTP from PHP and proxied by Nginx ○ File content compaction eats up too much disk ■ Fix: "Offloading" cold and large content to S3 using iterative scripts and real-time decisions
Valhalla 3.0 Retrospective ● What
worked ○ Version 2.0 issues fixed ● Problems to solve ○ Changes invalidate cached PROPFINDs, and then clients do a PROPFIND ■ Fix: Extend schema and API to support volume and directory event propagation ○ Single-threaded client still bottlenecks ■ Fix: New, multithreaded client ○ Client uses a write-invalidate cache ■ Fix: Move to a write-through/write-back model
Meanwhile, in backups ● Stopped
using davfs2 file mounts ● New backup preparation algorithm a. Backup builder downloads volume manifest b. Iterates through each file and goes directly from S3 to the tarball c. Any files not yet on S3 get pushed there by requesting an "offload" ● Lower client overhead ● Lower server overhead ● Longer backup preparation time
Valhalla 4.0 Retrospective ● What
worked ○ Version 3.0 issues fixed ● Problems to solve ○ Cloning volumes breaks the event stream ■ Fix: Invalidate events from before the volume clone request ○ Clients receiving earlier copies of their own events ■ Fix: Only send clients events published by other clients ○ Clients write a file and then have to re-download it because of ETag limitations ■ Fix: Extend PUT to send ETag on response ○ Iteration through file content items times out ■ Fix: Iterate through local sstable keys
Implementing the Client Side ●
Ditched davfs2 ○ Single-threaded with only experimental patches to multi-thread ○ Crufty code base designed to abstract FUSE versus Coda ● Based code off of fusedav ○ Already multithreaded ○ Uses proven Neon WebDAV client ● Gutted cache ○ Needed fine-grained update capability for write- through and write-back ○ Replaced with LevelDB ● Added in high-level FUSE operations ○ Atomic open+truncate, atomic create+open, etc.
Caching model ● LevelDB ○
Embeddable with low overhead ○ Iteratation without allocation management ○ Data model identical to single Cassandra row ○ Storage model similar to Cassandra sstables ○ Similar atomicity to row changes in Cassandra 1.1+ ● Mirrored volume row locally ○ Including prefixes and metadata ○ May move to Merkel-based replication later
What's Next at Pantheon ●
Move more toward a pNFS model ○ No file content storage in Cassandra (all in S3) ○ Peer-to-peer or other non-Cassandra file content coordination between clients ● Peer-to-peer cache advisories between clients ○ Less chatty server communication to poll events ○ Smaller window of incoherence (3s to <1s) ● Dropping the "fast path" ○ Client is already multithreaded ○ Client cache is smarter than direct Valhalla access ○ Minimizes incompatibility with Drupal
What's Next for the Community
● Finalize GPL-licensed FuseDAV client ○ Already public on GitHub ○ Public test suite with bundled server ○ Coordinate with existing FuseDAV users to make the Pantheon version the official successor ● Publish WebDAV extensions and seek standards acceptance ○ Progressive PROPFIND ○ ETag on PUT
David Strauss ● My groups
○ Drupal Association ○ Pantheon Systems ○ systemd/udev ● Get in touch ○ david@davidstrauss.net ○ @davidstrauss ○ facebook.com/warpforge ● Learn more about Pantheon ○ Developer Open House ○ Presented by Kyle Mathews and Josh Koenig ○ Thursday, February 14th, 12PM PST ○ Sign up: http://tinyurl.com/a3ofpc2