SlideShare a Scribd company logo
1 of 21
Valhalla at Pantheon
     A Distributed File System Built
on Cassandra, Twisted Python, and FUSE
Pantheon's Requirements
● Density
  ○ Over 50K volumes in a single cluster
  ○ Over 1000 clients on a single application server
● Storage volume
  ○ Over 10TB in a single cluster
  ○ De-duplication of redundant data
● Throughput
  ○ Peaks during the U.S. business day and during site
    imports and backups
● Performance
  ○ Back-end for Drupal web applications; access
    has to be fast enough to not burden a web request
  ○ The applications won't be adapted from running on
    local disk to running on Valhalla
Why not off-the-shelf?
● NFS
  ○ UID mapping requires trusted clients and networks
  ○ Standard Kerberos implementations have no HA
  ○ No cloud HA for client/server communication
● GlusterFS
  ○ Cannot scale volume density (though HekaFS can)
  ○ Cannot de-duplicate data
● Ceph
  ○ Security model relies on trusted clients
● MooseFS
  ○ Only primitive security
Valhalla's Design Manifesto
● Drupal applications read and write whole
  files between 10KB and 10MB
   ○ And most reads hit the edge proxy cache
● Drupal tracks files in its database and has
  little need for fstat() or directory listings
● POSIX compliance for locking and
  permissions is unimportant
   ○ But volume-level access control is critical
● Volumes may contain up to 1MM files
● Availability and performance trump
  consistency
volumes                                               content_by_file


       /d1/      /d1/f1.txt         /d1/d3/    /d1/d3/f2.txt                 content
vol1                                                       ...   ade12...
                 ade12...                      c12bea...                     binary



        /dir1/     /dir1/file.txt       /dir1/f2.txt                         content
vol2                                                       ...   c12bea...
                   ade12...             c12bea...                            binary



        /dir3/     /dir3/f3.txt         /dir3/f2.txt                         content
vol3                                                       ...   13a8cd...
                   13a8cd...            c12bea...                            binary

                              ...                                                      ...




                                                       Valhalla 1.0
Valhalla 1.0 Retrospective
● What worked
  ○ Efficient volume cloning
● What didn't
  ○ Slow computation of directory content when a
    directory is small but contains a large subdirectory
    ■ Fix: Depth prefix for entries
  ○ Slow computation of file size
    ■ Fix: Denormalize metadata into directory entries
  ○ Problems replicating large files
    ■ Fix: Split files into chunks
volumes                                                            content_by_file


       1:/d1/      1:/d1/f1.txt              1:/d1/d3/     2:/d1/d3/f2.txt                        0           1
vol1               {"size": 1243,                          {"size": 111,
                                                                                ...   ade12...
                    "hash": ade12...                        "hash": c12bea...
                                                                                                  binary      binary



        1:/dir1/      1:/dir1/file.txt           1:/dir1/f2.txt                                   0
vol2                                                                            ...   c12bea...
                      {"size": 1243,             {"size": 111,                                    binary
                      "hash": ade12...            "hash": c12bea...




        1:/dir3/        1:/dir3/f3.txt            1:/dir3/f2.txt                                  0           1        2
vol3                                                                            ...   13a8cd...
                        {"size": 5243,            {"size": 111,                                   binary      binary   binary
                        "hash": 13a8cd...          "hash": c12bea...


                                       ...                                                              ...




                                                             Valhalla 2.0
Valhalla 2.0 Retrospective
● What worked
  ○ Version 1.0 issues fixed
● Problems to solve
  ○ Directory listings iterate over many columns
    ■ Fix: Cache complete PROPFIND responses
  ○ Single-threaded client bottlenecks
    ■ Fix: "Fast path" with direct HTTP from PHP and
        proxied by Nginx
  ○ File content compaction eats up too much disk
    ■ Fix: "Offloading" cold and large content to S3
        using iterative scripts and real-time decisions
listing_cache                               Unchanged

                                                  content_by_file
       /dir1/         /dir2/
vol1
       binary         binary
                                                        ...


       /dir1/                                        volumes
vol2
       binary
                                                        ...

       /d1/           /d1/d2/   /d3/
vol3
       binary         binary    binary

                ...




                                   Valhalla 3.0
Valhalla 3.0 Retrospective
● What worked
  ○ Version 2.0 issues fixed
● Problems to solve
  ○ Changes invalidate cached PROPFINDs, and then
    clients do a PROPFIND
    ■ Fix: Extend schema and API to support volume
        and directory event propagation
  ○ Single-threaded client still bottlenecks
    ■ Fix: New, multithreaded client
  ○ Client uses a write-invalidate cache
    ■ Fix: Move to a write-through/write-back model
Meanwhile, in backups
● Stopped using davfs2 file mounts
● New backup preparation algorithm
  a. Backup builder downloads volume manifest
  b. Iterates through each file and goes directly from S3
     to the tarball
  c. Any files not yet on S3 get pushed there by
     requesting an "offload"
● Lower client overhead
● Lower server overhead
● Longer backup preparation time
events                                       Unchanged

                                                                                       content_by_file
               t=1                                  t=2
vol1:/dir1/
               {"path": "/dir2/","event":           {"path": "/dir2/f2.txt","event":
               "CREATED"...                         "CREATED"...
                                                                                             ...


               t=5                                  t=6                                   volumes
vol1:/dir2/
               {"path": "/dir5/","event":           {"path": "/dir6/","event":
               "CREATED"...                         "CREATED"...
                                                                                             ...

               t=5                                  t=6
                                                                                        listing_cache
vol3:/d1/d2/
               {"path": "f3.txt","event":           {"path": "f3.txt","event":
               "CREATED"...                         "DESTROYED"...


                                              ...                                            ...




                                                      Valhalla 4.0
Valhalla 4.0 Retrospective
● What worked
  ○ Version 3.0 issues fixed
● Problems to solve
  ○ Cloning volumes breaks the event stream
     ■ Fix: Invalidate events from before the volume
        clone request
  ○ Clients receiving earlier copies of their own events
     ■ Fix: Only send clients events published by other
        clients
  ○ Clients write a file and then have to re-download it
    because of ETag limitations
     ■ Fix: Extend PUT to send ETag on response
  ○ Iteration through file content items times out
     ■ Fix: Iterate through local sstable keys
volume_metadata              Unchanged

                                              content_by_file
       rewritten
vol1
       t=3
                                                    ...

                                                 volumes
vol2

                                                    ...

       rewritten
                                               listing_cache
vol3
       t=2

                         ...                        ...

                                                  events



                                                    ...
                               Valhalla 4.5
Implementing the Client Side
● Ditched davfs2
  ○ Single-threaded with only experimental patches to
    multi-thread
  ○ Crufty code base designed to abstract FUSE versus
    Coda
● Based code off of fusedav
  ○ Already multithreaded
  ○ Uses proven Neon WebDAV client
● Gutted cache
  ○ Needed fine-grained update capability for write-
    through and write-back
  ○ Replaced with LevelDB
● Added in high-level FUSE operations
  ○ Atomic open+truncate, atomic create+open, etc.
Caching model
● LevelDB
  ○   Embeddable with low overhead
  ○   Iteratation without allocation management
  ○   Data model identical to single Cassandra row
  ○   Storage model similar to Cassandra sstables
  ○   Similar atomicity to row changes in Cassandra 1.1+
● Mirrored volume row locally
  ○ Including prefixes and metadata
  ○ May move to Merkel-based replication later
Benchmarks versus Local and Older Models
Benchmarks versus Local and Older Models
What's Next at Pantheon
● Move more toward a pNFS model
  ○ No file content storage in Cassandra (all in S3)
  ○ Peer-to-peer or other non-Cassandra file content
    coordination between clients
● Peer-to-peer cache advisories between
  clients
  ○ Less chatty server communication to poll events
  ○ Smaller window of incoherence (3s to <1s)
● Dropping the "fast path"
  ○ Client is already multithreaded
  ○ Client cache is smarter than direct Valhalla access
  ○ Minimizes incompatibility with Drupal
What's Next for the Community
● Finalize GPL-licensed FuseDAV client
  ○ Already public on GitHub
  ○ Public test suite with bundled server
  ○ Coordinate with existing FuseDAV users to make the
    Pantheon version the official successor
● Publish WebDAV extensions and seek
  standards acceptance
  ○ Progressive PROPFIND
  ○ ETag on PUT
David Strauss
● My groups
  ○ Drupal Association
  ○ Pantheon Systems
  ○ systemd/udev
● Get in touch
  ○ david@davidstrauss.net
  ○ @davidstrauss
  ○ facebook.com/warpforge
● Learn more about Pantheon
  ○   Developer Open House
  ○   Presented by Kyle Mathews and Josh Koenig
  ○   Thursday, February 14th, 12PM PST
  ○   Sign up: http://tinyurl.com/a3ofpc2

More Related Content

Similar to Valhalla at Pantheon

Troubleshooting containerized triple o deployment
Troubleshooting containerized triple o deploymentTroubleshooting containerized triple o deployment
Troubleshooting containerized triple o deploymentSadique Puthen
 
ContainerDayVietnam2016: Django Development with Docker
ContainerDayVietnam2016: Django Development with DockerContainerDayVietnam2016: Django Development with Docker
ContainerDayVietnam2016: Django Development with DockerDocker-Hanoi
 
Rooting Out Root: User namespaces in Docker
Rooting Out Root: User namespaces in DockerRooting Out Root: User namespaces in Docker
Rooting Out Root: User namespaces in DockerPhil Estes
 
Taking Control of Chaos with Docker and Puppet
Taking Control of Chaos with Docker and PuppetTaking Control of Chaos with Docker and Puppet
Taking Control of Chaos with Docker and PuppetPuppet
 
Docker puppetcamp london 2013
Docker puppetcamp london 2013Docker puppetcamp london 2013
Docker puppetcamp london 2013Tomas Doran
 
Docker4Drupal 2.1 for Development
Docker4Drupal 2.1 for DevelopmentDocker4Drupal 2.1 for Development
Docker4Drupal 2.1 for DevelopmentWebsolutions Agency
 
Erik Skytthe - Monitoring Mesos, Docker, Containers with Zabbix | ZabConf2016
Erik Skytthe - Monitoring Mesos, Docker, Containers with Zabbix | ZabConf2016Erik Skytthe - Monitoring Mesos, Docker, Containers with Zabbix | ZabConf2016
Erik Skytthe - Monitoring Mesos, Docker, Containers with Zabbix | ZabConf2016Zabbix
 
Introduction to Docker and Monitoring with InfluxData
Introduction to Docker and Monitoring with InfluxDataIntroduction to Docker and Monitoring with InfluxData
Introduction to Docker and Monitoring with InfluxDataInfluxData
 
The whale, the container, and the ocean
The whale, the container, and the oceanThe whale, the container, and the ocean
The whale, the container, and the oceanNick Palenchar
 
Balena: a Moby-based container engine for IoT
Balena: a Moby-based container engine for IoT Balena: a Moby-based container engine for IoT
Balena: a Moby-based container engine for IoT Balena
 
State of Containers and the Convergence of HPC and BigData
State of Containers and the Convergence of HPC and BigDataState of Containers and the Convergence of HPC and BigData
State of Containers and the Convergence of HPC and BigDatainside-BigData.com
 
The internals and the latest trends of container runtimes
The internals and the latest trends of container runtimesThe internals and the latest trends of container runtimes
The internals and the latest trends of container runtimesAkihiro Suda
 
Inside Docker for Fedora20/RHEL7
Inside Docker for Fedora20/RHEL7Inside Docker for Fedora20/RHEL7
Inside Docker for Fedora20/RHEL7Etsuji Nakai
 
Real World Experience of Running Docker in Development and Production
Real World Experience of Running Docker in Development and ProductionReal World Experience of Running Docker in Development and Production
Real World Experience of Running Docker in Development and ProductionBen Hall
 
Take care of hundred containers and not go crazy
Take care of hundred containers and not go crazyTake care of hundred containers and not go crazy
Take care of hundred containers and not go crazyHonza Horák
 
Be a Happier Developer with Docker: Tricks of the Trade
Be a Happier Developer with Docker: Tricks of the TradeBe a Happier Developer with Docker: Tricks of the Trade
Be a Happier Developer with Docker: Tricks of the TradeDocker, Inc.
 
Orchestrating Docker with OpenStack
Orchestrating Docker with OpenStackOrchestrating Docker with OpenStack
Orchestrating Docker with OpenStackErica Windisch
 
OSDC 2015: Roland Kammerer | DRBD9: Managing High-Available Storage in Many-N...
OSDC 2015: Roland Kammerer | DRBD9: Managing High-Available Storage in Many-N...OSDC 2015: Roland Kammerer | DRBD9: Managing High-Available Storage in Many-N...
OSDC 2015: Roland Kammerer | DRBD9: Managing High-Available Storage in Many-N...NETWAYS
 

Similar to Valhalla at Pantheon (20)

Troubleshooting containerized triple o deployment
Troubleshooting containerized triple o deploymentTroubleshooting containerized triple o deployment
Troubleshooting containerized triple o deployment
 
ContainerDayVietnam2016: Django Development with Docker
ContainerDayVietnam2016: Django Development with DockerContainerDayVietnam2016: Django Development with Docker
ContainerDayVietnam2016: Django Development with Docker
 
Rooting Out Root: User namespaces in Docker
Rooting Out Root: User namespaces in DockerRooting Out Root: User namespaces in Docker
Rooting Out Root: User namespaces in Docker
 
Taking Control of Chaos with Docker and Puppet
Taking Control of Chaos with Docker and PuppetTaking Control of Chaos with Docker and Puppet
Taking Control of Chaos with Docker and Puppet
 
Docker puppetcamp london 2013
Docker puppetcamp london 2013Docker puppetcamp london 2013
Docker puppetcamp london 2013
 
dh-make-perl
dh-make-perldh-make-perl
dh-make-perl
 
Docker4Drupal 2.1 for Development
Docker4Drupal 2.1 for DevelopmentDocker4Drupal 2.1 for Development
Docker4Drupal 2.1 for Development
 
Erik Skytthe - Monitoring Mesos, Docker, Containers with Zabbix | ZabConf2016
Erik Skytthe - Monitoring Mesos, Docker, Containers with Zabbix | ZabConf2016Erik Skytthe - Monitoring Mesos, Docker, Containers with Zabbix | ZabConf2016
Erik Skytthe - Monitoring Mesos, Docker, Containers with Zabbix | ZabConf2016
 
Introduction to Docker and Monitoring with InfluxData
Introduction to Docker and Monitoring with InfluxDataIntroduction to Docker and Monitoring with InfluxData
Introduction to Docker and Monitoring with InfluxData
 
The whale, the container, and the ocean
The whale, the container, and the oceanThe whale, the container, and the ocean
The whale, the container, and the ocean
 
Balena: a Moby-based container engine for IoT
Balena: a Moby-based container engine for IoT Balena: a Moby-based container engine for IoT
Balena: a Moby-based container engine for IoT
 
State of Containers and the Convergence of HPC and BigData
State of Containers and the Convergence of HPC and BigDataState of Containers and the Convergence of HPC and BigData
State of Containers and the Convergence of HPC and BigData
 
The internals and the latest trends of container runtimes
The internals and the latest trends of container runtimesThe internals and the latest trends of container runtimes
The internals and the latest trends of container runtimes
 
Inside Docker for Fedora20/RHEL7
Inside Docker for Fedora20/RHEL7Inside Docker for Fedora20/RHEL7
Inside Docker for Fedora20/RHEL7
 
Real World Experience of Running Docker in Development and Production
Real World Experience of Running Docker in Development and ProductionReal World Experience of Running Docker in Development and Production
Real World Experience of Running Docker in Development and Production
 
Take care of hundred containers and not go crazy
Take care of hundred containers and not go crazyTake care of hundred containers and not go crazy
Take care of hundred containers and not go crazy
 
Demo 0.9.4
Demo 0.9.4Demo 0.9.4
Demo 0.9.4
 
Be a Happier Developer with Docker: Tricks of the Trade
Be a Happier Developer with Docker: Tricks of the TradeBe a Happier Developer with Docker: Tricks of the Trade
Be a Happier Developer with Docker: Tricks of the Trade
 
Orchestrating Docker with OpenStack
Orchestrating Docker with OpenStackOrchestrating Docker with OpenStack
Orchestrating Docker with OpenStack
 
OSDC 2015: Roland Kammerer | DRBD9: Managing High-Available Storage in Many-N...
OSDC 2015: Roland Kammerer | DRBD9: Managing High-Available Storage in Many-N...OSDC 2015: Roland Kammerer | DRBD9: Managing High-Available Storage in Many-N...
OSDC 2015: Roland Kammerer | DRBD9: Managing High-Available Storage in Many-N...
 

More from David Timothy Strauss

More from David Timothy Strauss (13)

Advanced Drupal 8 Caching
Advanced Drupal 8 CachingAdvanced Drupal 8 Caching
Advanced Drupal 8 Caching
 
LCache DrupalCon Dublin 2016
LCache DrupalCon Dublin 2016LCache DrupalCon Dublin 2016
LCache DrupalCon Dublin 2016
 
Container Security via Monitoring and Orchestration - Container Security Summit
Container Security via Monitoring and Orchestration - Container Security SummitContainer Security via Monitoring and Orchestration - Container Security Summit
Container Security via Monitoring and Orchestration - Container Security Summit
 
Don't Build "Death Star" Security - O'Reilly Software Architecture Conference...
Don't Build "Death Star" Security - O'Reilly Software Architecture Conference...Don't Build "Death Star" Security - O'Reilly Software Architecture Conference...
Don't Build "Death Star" Security - O'Reilly Software Architecture Conference...
 
Effective service and resource management with systemd
Effective service and resource management with systemdEffective service and resource management with systemd
Effective service and resource management with systemd
 
Containers > VMs
Containers > VMsContainers > VMs
Containers > VMs
 
PHP at Density and Scale (Lone Star PHP 2014)
PHP at Density and Scale (Lone Star PHP 2014)PHP at Density and Scale (Lone Star PHP 2014)
PHP at Density and Scale (Lone Star PHP 2014)
 
PHP at Density and Scale
PHP at Density and ScalePHP at Density and Scale
PHP at Density and Scale
 
PHP at Density and Scale
PHP at Density and ScalePHP at Density and Scale
PHP at Density and Scale
 
Scalable Drupal Infrastructure
Scalable Drupal InfrastructureScalable Drupal Infrastructure
Scalable Drupal Infrastructure
 
Planning LAMP infrastructure
Planning LAMP infrastructurePlanning LAMP infrastructure
Planning LAMP infrastructure
 
Is Drupal Secure?
Is Drupal Secure?Is Drupal Secure?
Is Drupal Secure?
 
Cassandra queuing
Cassandra queuingCassandra queuing
Cassandra queuing
 

Recently uploaded

A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersNicole Novielli
 
Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...itnewsafrica
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI AgeCprime
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfNeo4j
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Mark Goldstein
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotesMuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotesManik S Magar
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityIES VE
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsNathaniel Shimoni
 
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical InfrastructureVarsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructureitnewsafrica
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesThousandEyes
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfpanagenda
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS:  6 Ways to Automate Your Data IntegrationBridging Between CAD & GIS:  6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integrationmarketing932765
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesKari Kakkonen
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch TuesdayIvanti
 

Recently uploaded (20)

A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software Developers
 
Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI Age
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdf
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotesMuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a reality
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directions
 
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical InfrastructureVarsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS:  6 Ways to Automate Your Data IntegrationBridging Between CAD & GIS:  6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examples
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch Tuesday
 

Valhalla at Pantheon

  • 1. Valhalla at Pantheon A Distributed File System Built on Cassandra, Twisted Python, and FUSE
  • 2. Pantheon's Requirements ● Density ○ Over 50K volumes in a single cluster ○ Over 1000 clients on a single application server ● Storage volume ○ Over 10TB in a single cluster ○ De-duplication of redundant data ● Throughput ○ Peaks during the U.S. business day and during site imports and backups ● Performance ○ Back-end for Drupal web applications; access has to be fast enough to not burden a web request ○ The applications won't be adapted from running on local disk to running on Valhalla
  • 3. Why not off-the-shelf? ● NFS ○ UID mapping requires trusted clients and networks ○ Standard Kerberos implementations have no HA ○ No cloud HA for client/server communication ● GlusterFS ○ Cannot scale volume density (though HekaFS can) ○ Cannot de-duplicate data ● Ceph ○ Security model relies on trusted clients ● MooseFS ○ Only primitive security
  • 4. Valhalla's Design Manifesto ● Drupal applications read and write whole files between 10KB and 10MB ○ And most reads hit the edge proxy cache ● Drupal tracks files in its database and has little need for fstat() or directory listings ● POSIX compliance for locking and permissions is unimportant ○ But volume-level access control is critical ● Volumes may contain up to 1MM files ● Availability and performance trump consistency
  • 5. volumes content_by_file /d1/ /d1/f1.txt /d1/d3/ /d1/d3/f2.txt content vol1 ... ade12... ade12... c12bea... binary /dir1/ /dir1/file.txt /dir1/f2.txt content vol2 ... c12bea... ade12... c12bea... binary /dir3/ /dir3/f3.txt /dir3/f2.txt content vol3 ... 13a8cd... 13a8cd... c12bea... binary ... ... Valhalla 1.0
  • 6. Valhalla 1.0 Retrospective ● What worked ○ Efficient volume cloning ● What didn't ○ Slow computation of directory content when a directory is small but contains a large subdirectory ■ Fix: Depth prefix for entries ○ Slow computation of file size ■ Fix: Denormalize metadata into directory entries ○ Problems replicating large files ■ Fix: Split files into chunks
  • 7. volumes content_by_file 1:/d1/ 1:/d1/f1.txt 1:/d1/d3/ 2:/d1/d3/f2.txt 0 1 vol1 {"size": 1243, {"size": 111, ... ade12... "hash": ade12... "hash": c12bea... binary binary 1:/dir1/ 1:/dir1/file.txt 1:/dir1/f2.txt 0 vol2 ... c12bea... {"size": 1243, {"size": 111, binary "hash": ade12... "hash": c12bea... 1:/dir3/ 1:/dir3/f3.txt 1:/dir3/f2.txt 0 1 2 vol3 ... 13a8cd... {"size": 5243, {"size": 111, binary binary binary "hash": 13a8cd... "hash": c12bea... ... ... Valhalla 2.0
  • 8. Valhalla 2.0 Retrospective ● What worked ○ Version 1.0 issues fixed ● Problems to solve ○ Directory listings iterate over many columns ■ Fix: Cache complete PROPFIND responses ○ Single-threaded client bottlenecks ■ Fix: "Fast path" with direct HTTP from PHP and proxied by Nginx ○ File content compaction eats up too much disk ■ Fix: "Offloading" cold and large content to S3 using iterative scripts and real-time decisions
  • 9. listing_cache Unchanged content_by_file /dir1/ /dir2/ vol1 binary binary ... /dir1/ volumes vol2 binary ... /d1/ /d1/d2/ /d3/ vol3 binary binary binary ... Valhalla 3.0
  • 10. Valhalla 3.0 Retrospective ● What worked ○ Version 2.0 issues fixed ● Problems to solve ○ Changes invalidate cached PROPFINDs, and then clients do a PROPFIND ■ Fix: Extend schema and API to support volume and directory event propagation ○ Single-threaded client still bottlenecks ■ Fix: New, multithreaded client ○ Client uses a write-invalidate cache ■ Fix: Move to a write-through/write-back model
  • 11. Meanwhile, in backups ● Stopped using davfs2 file mounts ● New backup preparation algorithm a. Backup builder downloads volume manifest b. Iterates through each file and goes directly from S3 to the tarball c. Any files not yet on S3 get pushed there by requesting an "offload" ● Lower client overhead ● Lower server overhead ● Longer backup preparation time
  • 12. events Unchanged content_by_file t=1 t=2 vol1:/dir1/ {"path": "/dir2/","event": {"path": "/dir2/f2.txt","event": "CREATED"... "CREATED"... ... t=5 t=6 volumes vol1:/dir2/ {"path": "/dir5/","event": {"path": "/dir6/","event": "CREATED"... "CREATED"... ... t=5 t=6 listing_cache vol3:/d1/d2/ {"path": "f3.txt","event": {"path": "f3.txt","event": "CREATED"... "DESTROYED"... ... ... Valhalla 4.0
  • 13. Valhalla 4.0 Retrospective ● What worked ○ Version 3.0 issues fixed ● Problems to solve ○ Cloning volumes breaks the event stream ■ Fix: Invalidate events from before the volume clone request ○ Clients receiving earlier copies of their own events ■ Fix: Only send clients events published by other clients ○ Clients write a file and then have to re-download it because of ETag limitations ■ Fix: Extend PUT to send ETag on response ○ Iteration through file content items times out ■ Fix: Iterate through local sstable keys
  • 14. volume_metadata Unchanged content_by_file rewritten vol1 t=3 ... volumes vol2 ... rewritten listing_cache vol3 t=2 ... ... events ... Valhalla 4.5
  • 15. Implementing the Client Side ● Ditched davfs2 ○ Single-threaded with only experimental patches to multi-thread ○ Crufty code base designed to abstract FUSE versus Coda ● Based code off of fusedav ○ Already multithreaded ○ Uses proven Neon WebDAV client ● Gutted cache ○ Needed fine-grained update capability for write- through and write-back ○ Replaced with LevelDB ● Added in high-level FUSE operations ○ Atomic open+truncate, atomic create+open, etc.
  • 16. Caching model ● LevelDB ○ Embeddable with low overhead ○ Iteratation without allocation management ○ Data model identical to single Cassandra row ○ Storage model similar to Cassandra sstables ○ Similar atomicity to row changes in Cassandra 1.1+ ● Mirrored volume row locally ○ Including prefixes and metadata ○ May move to Merkel-based replication later
  • 17. Benchmarks versus Local and Older Models
  • 18. Benchmarks versus Local and Older Models
  • 19. What's Next at Pantheon ● Move more toward a pNFS model ○ No file content storage in Cassandra (all in S3) ○ Peer-to-peer or other non-Cassandra file content coordination between clients ● Peer-to-peer cache advisories between clients ○ Less chatty server communication to poll events ○ Smaller window of incoherence (3s to <1s) ● Dropping the "fast path" ○ Client is already multithreaded ○ Client cache is smarter than direct Valhalla access ○ Minimizes incompatibility with Drupal
  • 20. What's Next for the Community ● Finalize GPL-licensed FuseDAV client ○ Already public on GitHub ○ Public test suite with bundled server ○ Coordinate with existing FuseDAV users to make the Pantheon version the official successor ● Publish WebDAV extensions and seek standards acceptance ○ Progressive PROPFIND ○ ETag on PUT
  • 21. David Strauss ● My groups ○ Drupal Association ○ Pantheon Systems ○ systemd/udev ● Get in touch ○ david@davidstrauss.net ○ @davidstrauss ○ facebook.com/warpforge ● Learn more about Pantheon ○ Developer Open House ○ Presented by Kyle Mathews and Josh Koenig ○ Thursday, February 14th, 12PM PST ○ Sign up: http://tinyurl.com/a3ofpc2