SlideShare a Scribd company logo
1 of 21
Valhalla at Pantheon
     A Distributed File System Built
on Cassandra, Twisted Python, and FUSE
Pantheon's Requirements
● Density
  ○ Over 50K volumes in a single cluster
  ○ Over 1000 clients on a single application server
● Storage volume
  ○ Over 10TB in a single cluster
  ○ De-duplication of redundant data
● Throughput
  ○ Peaks during the U.S. business day and during site
    imports and backups
● Performance
  ○ Back-end for Drupal web applications; access
    has to be fast enough to not burden a web request
  ○ The applications won't be adapted from running on
    local disk to running on Valhalla
Why not off-the-shelf?
● NFS
  ○ UID mapping requires trusted clients and networks
  ○ Standard Kerberos implementations have no HA
  ○ No cloud HA for client/server communication
● GlusterFS
  ○ Cannot scale volume density (though HekaFS can)
  ○ Cannot de-duplicate data
● Ceph
  ○ Security model relies on trusted clients
● MooseFS
  ○ Only primitive security
Valhalla's Design Manifesto
● Drupal applications read and write whole
  files between 10KB and 10MB
   ○ And most reads hit the edge proxy cache
● Drupal tracks files in its database and has
  little need for fstat() or directory listings
● POSIX compliance for locking and
  permissions is unimportant
   ○ But volume-level access control is critical
● Volumes may contain up to 1MM files
● Availability and performance trump
  consistency
volumes                                               content_by_file


       /d1/      /d1/f1.txt         /d1/d3/    /d1/d3/f2.txt                 content
vol1                                                       ...   ade12...
                 ade12...                      c12bea...                     binary



        /dir1/     /dir1/file.txt       /dir1/f2.txt                         content
vol2                                                       ...   c12bea...
                   ade12...             c12bea...                            binary



        /dir3/     /dir3/f3.txt         /dir3/f2.txt                         content
vol3                                                       ...   13a8cd...
                   13a8cd...            c12bea...                            binary

                              ...                                                      ...




                                                       Valhalla 1.0
Valhalla 1.0 Retrospective
● What worked
  ○ Efficient volume cloning
● What didn't
  ○ Slow computation of directory content when a
    directory is small but contains a large subdirectory
    ■ Fix: Depth prefix for entries
  ○ Slow computation of file size
    ■ Fix: Denormalize metadata into directory entries
  ○ Problems replicating large files
    ■ Fix: Split files into chunks
volumes                                                            content_by_file


       1:/d1/      1:/d1/f1.txt              1:/d1/d3/     2:/d1/d3/f2.txt                        0           1
vol1               {"size": 1243,                          {"size": 111,
                                                                                ...   ade12...
                    "hash": ade12...                        "hash": c12bea...
                                                                                                  binary      binary



        1:/dir1/      1:/dir1/file.txt           1:/dir1/f2.txt                                   0
vol2                                                                            ...   c12bea...
                      {"size": 1243,             {"size": 111,                                    binary
                      "hash": ade12...            "hash": c12bea...




        1:/dir3/        1:/dir3/f3.txt            1:/dir3/f2.txt                                  0           1        2
vol3                                                                            ...   13a8cd...
                        {"size": 5243,            {"size": 111,                                   binary      binary   binary
                        "hash": 13a8cd...          "hash": c12bea...


                                       ...                                                              ...




                                                             Valhalla 2.0
Valhalla 2.0 Retrospective
● What worked
  ○ Version 1.0 issues fixed
● Problems to solve
  ○ Directory listings iterate over many columns
    ■ Fix: Cache complete PROPFIND responses
  ○ Single-threaded client bottlenecks
    ■ Fix: "Fast path" with direct HTTP from PHP and
        proxied by Nginx
  ○ File content compaction eats up too much disk
    ■ Fix: "Offloading" cold and large content to S3
        using iterative scripts and real-time decisions
listing_cache                               Unchanged

                                                  content_by_file
       /dir1/         /dir2/
vol1
       binary         binary
                                                        ...


       /dir1/                                        volumes
vol2
       binary
                                                        ...

       /d1/           /d1/d2/   /d3/
vol3
       binary         binary    binary

                ...




                                   Valhalla 3.0
Valhalla 3.0 Retrospective
● What worked
  ○ Version 2.0 issues fixed
● Problems to solve
  ○ Changes invalidate cached PROPFINDs, and then
    clients do a PROPFIND
    ■ Fix: Extend schema and API to support volume
        and directory event propagation
  ○ Single-threaded client still bottlenecks
    ■ Fix: New, multithreaded client
  ○ Client uses a write-invalidate cache
    ■ Fix: Move to a write-through/write-back model
Meanwhile, in backups
● Stopped using davfs2 file mounts
● New backup preparation algorithm
  a. Backup builder downloads volume manifest
  b. Iterates through each file and goes directly from S3
     to the tarball
  c. Any files not yet on S3 get pushed there by
     requesting an "offload"
● Lower client overhead
● Lower server overhead
● Longer backup preparation time
events                                       Unchanged

                                                                                       content_by_file
               t=1                                  t=2
vol1:/dir1/
               {"path": "/dir2/","event":           {"path": "/dir2/f2.txt","event":
               "CREATED"...                         "CREATED"...
                                                                                             ...


               t=5                                  t=6                                   volumes
vol1:/dir2/
               {"path": "/dir5/","event":           {"path": "/dir6/","event":
               "CREATED"...                         "CREATED"...
                                                                                             ...

               t=5                                  t=6
                                                                                        listing_cache
vol3:/d1/d2/
               {"path": "f3.txt","event":           {"path": "f3.txt","event":
               "CREATED"...                         "DESTROYED"...


                                              ...                                            ...




                                                      Valhalla 4.0
Valhalla 4.0 Retrospective
● What worked
  ○ Version 3.0 issues fixed
● Problems to solve
  ○ Cloning volumes breaks the event stream
     ■ Fix: Invalidate events from before the volume
        clone request
  ○ Clients receiving earlier copies of their own events
     ■ Fix: Only send clients events published by other
        clients
  ○ Clients write a file and then have to re-download it
    because of ETag limitations
     ■ Fix: Extend PUT to send ETag on response
  ○ Iteration through file content items times out
     ■ Fix: Iterate through local sstable keys
volume_metadata              Unchanged

                                              content_by_file
       rewritten
vol1
       t=3
                                                    ...

                                                 volumes
vol2

                                                    ...

       rewritten
                                               listing_cache
vol3
       t=2

                         ...                        ...

                                                  events



                                                    ...
                               Valhalla 4.5
Implementing the Client Side
● Ditched davfs2
  ○ Single-threaded with only experimental patches to
    multi-thread
  ○ Crufty code base designed to abstract FUSE versus
    Coda
● Based code off of fusedav
  ○ Already multithreaded
  ○ Uses proven Neon WebDAV client
● Gutted cache
  ○ Needed fine-grained update capability for write-
    through and write-back
  ○ Replaced with LevelDB
● Added in high-level FUSE operations
  ○ Atomic open+truncate, atomic create+open, etc.
Caching model
● LevelDB
  ○   Embeddable with low overhead
  ○   Iteratation without allocation management
  ○   Data model identical to single Cassandra row
  ○   Storage model similar to Cassandra sstables
  ○   Similar atomicity to row changes in Cassandra 1.1+
● Mirrored volume row locally
  ○ Including prefixes and metadata
  ○ May move to Merkel-based replication later
Benchmarks versus Local and Older Models
Benchmarks versus Local and Older Models
What's Next at Pantheon
● Move more toward a pNFS model
  ○ No file content storage in Cassandra (all in S3)
  ○ Peer-to-peer or other non-Cassandra file content
    coordination between clients
● Peer-to-peer cache advisories between
  clients
  ○ Less chatty server communication to poll events
  ○ Smaller window of incoherence (3s to <1s)
● Dropping the "fast path"
  ○ Client is already multithreaded
  ○ Client cache is smarter than direct Valhalla access
  ○ Minimizes incompatibility with Drupal
What's Next for the Community
● Finalize GPL-licensed FuseDAV client
  ○ Already public on GitHub
  ○ Public test suite with bundled server
  ○ Coordinate with existing FuseDAV users to make the
    Pantheon version the official successor
● Publish WebDAV extensions and seek
  standards acceptance
  ○ Progressive PROPFIND
  ○ ETag on PUT
David Strauss
● My groups
  ○ Drupal Association
  ○ Pantheon Systems
  ○ systemd/udev
● Get in touch
  ○ david@davidstrauss.net
  ○ @davidstrauss
  ○ facebook.com/warpforge
● Learn more about Pantheon
  ○   Developer Open House
  ○   Presented by Kyle Mathews and Josh Koenig
  ○   Thursday, February 14th, 12PM PST
  ○   Sign up: http://tinyurl.com/a3ofpc2

More Related Content

Similar to Valhalla at Pantheon

Troubleshooting containerized triple o deployment
Troubleshooting containerized triple o deploymentTroubleshooting containerized triple o deployment
Troubleshooting containerized triple o deploymentSadique Puthen
 
ContainerDayVietnam2016: Django Development with Docker
ContainerDayVietnam2016: Django Development with DockerContainerDayVietnam2016: Django Development with Docker
ContainerDayVietnam2016: Django Development with DockerDocker-Hanoi
 
Rooting Out Root: User namespaces in Docker
Rooting Out Root: User namespaces in DockerRooting Out Root: User namespaces in Docker
Rooting Out Root: User namespaces in DockerPhil Estes
 
Taking Control of Chaos with Docker and Puppet
Taking Control of Chaos with Docker and PuppetTaking Control of Chaos with Docker and Puppet
Taking Control of Chaos with Docker and PuppetPuppet
 
Docker puppetcamp london 2013
Docker puppetcamp london 2013Docker puppetcamp london 2013
Docker puppetcamp london 2013Tomas Doran
 
Docker4Drupal 2.1 for Development
Docker4Drupal 2.1 for DevelopmentDocker4Drupal 2.1 for Development
Docker4Drupal 2.1 for DevelopmentWebsolutions Agency
 
Erik Skytthe - Monitoring Mesos, Docker, Containers with Zabbix | ZabConf2016
Erik Skytthe - Monitoring Mesos, Docker, Containers with Zabbix | ZabConf2016Erik Skytthe - Monitoring Mesos, Docker, Containers with Zabbix | ZabConf2016
Erik Skytthe - Monitoring Mesos, Docker, Containers with Zabbix | ZabConf2016Zabbix
 
Introduction to Docker and Monitoring with InfluxData
Introduction to Docker and Monitoring with InfluxDataIntroduction to Docker and Monitoring with InfluxData
Introduction to Docker and Monitoring with InfluxDataInfluxData
 
The whale, the container, and the ocean
The whale, the container, and the oceanThe whale, the container, and the ocean
The whale, the container, and the oceanNick Palenchar
 
Balena: a Moby-based container engine for IoT
Balena: a Moby-based container engine for IoT Balena: a Moby-based container engine for IoT
Balena: a Moby-based container engine for IoT Balena
 
State of Containers and the Convergence of HPC and BigData
State of Containers and the Convergence of HPC and BigDataState of Containers and the Convergence of HPC and BigData
State of Containers and the Convergence of HPC and BigDatainside-BigData.com
 
The internals and the latest trends of container runtimes
The internals and the latest trends of container runtimesThe internals and the latest trends of container runtimes
The internals and the latest trends of container runtimesAkihiro Suda
 
Inside Docker for Fedora20/RHEL7
Inside Docker for Fedora20/RHEL7Inside Docker for Fedora20/RHEL7
Inside Docker for Fedora20/RHEL7Etsuji Nakai
 
Real World Experience of Running Docker in Development and Production
Real World Experience of Running Docker in Development and ProductionReal World Experience of Running Docker in Development and Production
Real World Experience of Running Docker in Development and ProductionBen Hall
 
Take care of hundred containers and not go crazy
Take care of hundred containers and not go crazyTake care of hundred containers and not go crazy
Take care of hundred containers and not go crazyHonza Horák
 
Be a Happier Developer with Docker: Tricks of the Trade
Be a Happier Developer with Docker: Tricks of the TradeBe a Happier Developer with Docker: Tricks of the Trade
Be a Happier Developer with Docker: Tricks of the TradeDocker, Inc.
 
Orchestrating Docker with OpenStack
Orchestrating Docker with OpenStackOrchestrating Docker with OpenStack
Orchestrating Docker with OpenStackErica Windisch
 
OSDC 2015: Roland Kammerer | DRBD9: Managing High-Available Storage in Many-N...
OSDC 2015: Roland Kammerer | DRBD9: Managing High-Available Storage in Many-N...OSDC 2015: Roland Kammerer | DRBD9: Managing High-Available Storage in Many-N...
OSDC 2015: Roland Kammerer | DRBD9: Managing High-Available Storage in Many-N...NETWAYS
 

Similar to Valhalla at Pantheon (20)

Troubleshooting containerized triple o deployment
Troubleshooting containerized triple o deploymentTroubleshooting containerized triple o deployment
Troubleshooting containerized triple o deployment
 
ContainerDayVietnam2016: Django Development with Docker
ContainerDayVietnam2016: Django Development with DockerContainerDayVietnam2016: Django Development with Docker
ContainerDayVietnam2016: Django Development with Docker
 
Rooting Out Root: User namespaces in Docker
Rooting Out Root: User namespaces in DockerRooting Out Root: User namespaces in Docker
Rooting Out Root: User namespaces in Docker
 
Taking Control of Chaos with Docker and Puppet
Taking Control of Chaos with Docker and PuppetTaking Control of Chaos with Docker and Puppet
Taking Control of Chaos with Docker and Puppet
 
Docker puppetcamp london 2013
Docker puppetcamp london 2013Docker puppetcamp london 2013
Docker puppetcamp london 2013
 
dh-make-perl
dh-make-perldh-make-perl
dh-make-perl
 
Docker4Drupal 2.1 for Development
Docker4Drupal 2.1 for DevelopmentDocker4Drupal 2.1 for Development
Docker4Drupal 2.1 for Development
 
Erik Skytthe - Monitoring Mesos, Docker, Containers with Zabbix | ZabConf2016
Erik Skytthe - Monitoring Mesos, Docker, Containers with Zabbix | ZabConf2016Erik Skytthe - Monitoring Mesos, Docker, Containers with Zabbix | ZabConf2016
Erik Skytthe - Monitoring Mesos, Docker, Containers with Zabbix | ZabConf2016
 
Introduction to Docker and Monitoring with InfluxData
Introduction to Docker and Monitoring with InfluxDataIntroduction to Docker and Monitoring with InfluxData
Introduction to Docker and Monitoring with InfluxData
 
The whale, the container, and the ocean
The whale, the container, and the oceanThe whale, the container, and the ocean
The whale, the container, and the ocean
 
Balena: a Moby-based container engine for IoT
Balena: a Moby-based container engine for IoT Balena: a Moby-based container engine for IoT
Balena: a Moby-based container engine for IoT
 
State of Containers and the Convergence of HPC and BigData
State of Containers and the Convergence of HPC and BigDataState of Containers and the Convergence of HPC and BigData
State of Containers and the Convergence of HPC and BigData
 
The internals and the latest trends of container runtimes
The internals and the latest trends of container runtimesThe internals and the latest trends of container runtimes
The internals and the latest trends of container runtimes
 
Inside Docker for Fedora20/RHEL7
Inside Docker for Fedora20/RHEL7Inside Docker for Fedora20/RHEL7
Inside Docker for Fedora20/RHEL7
 
Real World Experience of Running Docker in Development and Production
Real World Experience of Running Docker in Development and ProductionReal World Experience of Running Docker in Development and Production
Real World Experience of Running Docker in Development and Production
 
Take care of hundred containers and not go crazy
Take care of hundred containers and not go crazyTake care of hundred containers and not go crazy
Take care of hundred containers and not go crazy
 
Demo 0.9.4
Demo 0.9.4Demo 0.9.4
Demo 0.9.4
 
Be a Happier Developer with Docker: Tricks of the Trade
Be a Happier Developer with Docker: Tricks of the TradeBe a Happier Developer with Docker: Tricks of the Trade
Be a Happier Developer with Docker: Tricks of the Trade
 
Orchestrating Docker with OpenStack
Orchestrating Docker with OpenStackOrchestrating Docker with OpenStack
Orchestrating Docker with OpenStack
 
OSDC 2015: Roland Kammerer | DRBD9: Managing High-Available Storage in Many-N...
OSDC 2015: Roland Kammerer | DRBD9: Managing High-Available Storage in Many-N...OSDC 2015: Roland Kammerer | DRBD9: Managing High-Available Storage in Many-N...
OSDC 2015: Roland Kammerer | DRBD9: Managing High-Available Storage in Many-N...
 

More from David Timothy Strauss

More from David Timothy Strauss (13)

Advanced Drupal 8 Caching
Advanced Drupal 8 CachingAdvanced Drupal 8 Caching
Advanced Drupal 8 Caching
 
LCache DrupalCon Dublin 2016
LCache DrupalCon Dublin 2016LCache DrupalCon Dublin 2016
LCache DrupalCon Dublin 2016
 
Container Security via Monitoring and Orchestration - Container Security Summit
Container Security via Monitoring and Orchestration - Container Security SummitContainer Security via Monitoring and Orchestration - Container Security Summit
Container Security via Monitoring and Orchestration - Container Security Summit
 
Don't Build "Death Star" Security - O'Reilly Software Architecture Conference...
Don't Build "Death Star" Security - O'Reilly Software Architecture Conference...Don't Build "Death Star" Security - O'Reilly Software Architecture Conference...
Don't Build "Death Star" Security - O'Reilly Software Architecture Conference...
 
Effective service and resource management with systemd
Effective service and resource management with systemdEffective service and resource management with systemd
Effective service and resource management with systemd
 
Containers > VMs
Containers > VMsContainers > VMs
Containers > VMs
 
PHP at Density and Scale (Lone Star PHP 2014)
PHP at Density and Scale (Lone Star PHP 2014)PHP at Density and Scale (Lone Star PHP 2014)
PHP at Density and Scale (Lone Star PHP 2014)
 
PHP at Density and Scale
PHP at Density and ScalePHP at Density and Scale
PHP at Density and Scale
 
PHP at Density and Scale
PHP at Density and ScalePHP at Density and Scale
PHP at Density and Scale
 
Scalable Drupal Infrastructure
Scalable Drupal InfrastructureScalable Drupal Infrastructure
Scalable Drupal Infrastructure
 
Planning LAMP infrastructure
Planning LAMP infrastructurePlanning LAMP infrastructure
Planning LAMP infrastructure
 
Is Drupal Secure?
Is Drupal Secure?Is Drupal Secure?
Is Drupal Secure?
 
Cassandra queuing
Cassandra queuingCassandra queuing
Cassandra queuing
 

Recently uploaded

New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demoHarshalMandlekar2
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESmohitsingh558521
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersNicole Novielli
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 

Recently uploaded (20)

New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demo
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software Developers
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 

Valhalla at Pantheon

  • 1. Valhalla at Pantheon A Distributed File System Built on Cassandra, Twisted Python, and FUSE
  • 2. Pantheon's Requirements ● Density ○ Over 50K volumes in a single cluster ○ Over 1000 clients on a single application server ● Storage volume ○ Over 10TB in a single cluster ○ De-duplication of redundant data ● Throughput ○ Peaks during the U.S. business day and during site imports and backups ● Performance ○ Back-end for Drupal web applications; access has to be fast enough to not burden a web request ○ The applications won't be adapted from running on local disk to running on Valhalla
  • 3. Why not off-the-shelf? ● NFS ○ UID mapping requires trusted clients and networks ○ Standard Kerberos implementations have no HA ○ No cloud HA for client/server communication ● GlusterFS ○ Cannot scale volume density (though HekaFS can) ○ Cannot de-duplicate data ● Ceph ○ Security model relies on trusted clients ● MooseFS ○ Only primitive security
  • 4. Valhalla's Design Manifesto ● Drupal applications read and write whole files between 10KB and 10MB ○ And most reads hit the edge proxy cache ● Drupal tracks files in its database and has little need for fstat() or directory listings ● POSIX compliance for locking and permissions is unimportant ○ But volume-level access control is critical ● Volumes may contain up to 1MM files ● Availability and performance trump consistency
  • 5. volumes content_by_file /d1/ /d1/f1.txt /d1/d3/ /d1/d3/f2.txt content vol1 ... ade12... ade12... c12bea... binary /dir1/ /dir1/file.txt /dir1/f2.txt content vol2 ... c12bea... ade12... c12bea... binary /dir3/ /dir3/f3.txt /dir3/f2.txt content vol3 ... 13a8cd... 13a8cd... c12bea... binary ... ... Valhalla 1.0
  • 6. Valhalla 1.0 Retrospective ● What worked ○ Efficient volume cloning ● What didn't ○ Slow computation of directory content when a directory is small but contains a large subdirectory ■ Fix: Depth prefix for entries ○ Slow computation of file size ■ Fix: Denormalize metadata into directory entries ○ Problems replicating large files ■ Fix: Split files into chunks
  • 7. volumes content_by_file 1:/d1/ 1:/d1/f1.txt 1:/d1/d3/ 2:/d1/d3/f2.txt 0 1 vol1 {"size": 1243, {"size": 111, ... ade12... "hash": ade12... "hash": c12bea... binary binary 1:/dir1/ 1:/dir1/file.txt 1:/dir1/f2.txt 0 vol2 ... c12bea... {"size": 1243, {"size": 111, binary "hash": ade12... "hash": c12bea... 1:/dir3/ 1:/dir3/f3.txt 1:/dir3/f2.txt 0 1 2 vol3 ... 13a8cd... {"size": 5243, {"size": 111, binary binary binary "hash": 13a8cd... "hash": c12bea... ... ... Valhalla 2.0
  • 8. Valhalla 2.0 Retrospective ● What worked ○ Version 1.0 issues fixed ● Problems to solve ○ Directory listings iterate over many columns ■ Fix: Cache complete PROPFIND responses ○ Single-threaded client bottlenecks ■ Fix: "Fast path" with direct HTTP from PHP and proxied by Nginx ○ File content compaction eats up too much disk ■ Fix: "Offloading" cold and large content to S3 using iterative scripts and real-time decisions
  • 9. listing_cache Unchanged content_by_file /dir1/ /dir2/ vol1 binary binary ... /dir1/ volumes vol2 binary ... /d1/ /d1/d2/ /d3/ vol3 binary binary binary ... Valhalla 3.0
  • 10. Valhalla 3.0 Retrospective ● What worked ○ Version 2.0 issues fixed ● Problems to solve ○ Changes invalidate cached PROPFINDs, and then clients do a PROPFIND ■ Fix: Extend schema and API to support volume and directory event propagation ○ Single-threaded client still bottlenecks ■ Fix: New, multithreaded client ○ Client uses a write-invalidate cache ■ Fix: Move to a write-through/write-back model
  • 11. Meanwhile, in backups ● Stopped using davfs2 file mounts ● New backup preparation algorithm a. Backup builder downloads volume manifest b. Iterates through each file and goes directly from S3 to the tarball c. Any files not yet on S3 get pushed there by requesting an "offload" ● Lower client overhead ● Lower server overhead ● Longer backup preparation time
  • 12. events Unchanged content_by_file t=1 t=2 vol1:/dir1/ {"path": "/dir2/","event": {"path": "/dir2/f2.txt","event": "CREATED"... "CREATED"... ... t=5 t=6 volumes vol1:/dir2/ {"path": "/dir5/","event": {"path": "/dir6/","event": "CREATED"... "CREATED"... ... t=5 t=6 listing_cache vol3:/d1/d2/ {"path": "f3.txt","event": {"path": "f3.txt","event": "CREATED"... "DESTROYED"... ... ... Valhalla 4.0
  • 13. Valhalla 4.0 Retrospective ● What worked ○ Version 3.0 issues fixed ● Problems to solve ○ Cloning volumes breaks the event stream ■ Fix: Invalidate events from before the volume clone request ○ Clients receiving earlier copies of their own events ■ Fix: Only send clients events published by other clients ○ Clients write a file and then have to re-download it because of ETag limitations ■ Fix: Extend PUT to send ETag on response ○ Iteration through file content items times out ■ Fix: Iterate through local sstable keys
  • 14. volume_metadata Unchanged content_by_file rewritten vol1 t=3 ... volumes vol2 ... rewritten listing_cache vol3 t=2 ... ... events ... Valhalla 4.5
  • 15. Implementing the Client Side ● Ditched davfs2 ○ Single-threaded with only experimental patches to multi-thread ○ Crufty code base designed to abstract FUSE versus Coda ● Based code off of fusedav ○ Already multithreaded ○ Uses proven Neon WebDAV client ● Gutted cache ○ Needed fine-grained update capability for write- through and write-back ○ Replaced with LevelDB ● Added in high-level FUSE operations ○ Atomic open+truncate, atomic create+open, etc.
  • 16. Caching model ● LevelDB ○ Embeddable with low overhead ○ Iteratation without allocation management ○ Data model identical to single Cassandra row ○ Storage model similar to Cassandra sstables ○ Similar atomicity to row changes in Cassandra 1.1+ ● Mirrored volume row locally ○ Including prefixes and metadata ○ May move to Merkel-based replication later
  • 17. Benchmarks versus Local and Older Models
  • 18. Benchmarks versus Local and Older Models
  • 19. What's Next at Pantheon ● Move more toward a pNFS model ○ No file content storage in Cassandra (all in S3) ○ Peer-to-peer or other non-Cassandra file content coordination between clients ● Peer-to-peer cache advisories between clients ○ Less chatty server communication to poll events ○ Smaller window of incoherence (3s to <1s) ● Dropping the "fast path" ○ Client is already multithreaded ○ Client cache is smarter than direct Valhalla access ○ Minimizes incompatibility with Drupal
  • 20. What's Next for the Community ● Finalize GPL-licensed FuseDAV client ○ Already public on GitHub ○ Public test suite with bundled server ○ Coordinate with existing FuseDAV users to make the Pantheon version the official successor ● Publish WebDAV extensions and seek standards acceptance ○ Progressive PROPFIND ○ ETag on PUT
  • 21. David Strauss ● My groups ○ Drupal Association ○ Pantheon Systems ○ systemd/udev ● Get in touch ○ david@davidstrauss.net ○ @davidstrauss ○ facebook.com/warpforge ● Learn more about Pantheon ○ Developer Open House ○ Presented by Kyle Mathews and Josh Koenig ○ Thursday, February 14th, 12PM PST ○ Sign up: http://tinyurl.com/a3ofpc2