Designing, Scoping, and Configuring
Scalable Drupal
Infrastructure


Presented 2009-05-30 by David Strauss
Understanding
Load Distribution
Predicting peak traffic
Traffic over the day can be highly irregular. To plan
for peak loads, design as if all traffic were as h...
Analyzing hit distribution
                                                                                       40%
    ...
Throughput vs. Delivery Methods
                             Green                    Yellow                     Red
     ...
Objective

Deliver hits using the
fastest, most scalable
  method available.
Layering: Less Traffic at Each Step

         Your Datacenter



            Load             Reverse
                      ...
Offload from the master database
      Search      Your master database is the single
                  greatest limitation ...
Tools to use
‣   Apache Solr for search.
    (Acquia offers hosting of this now.)
‣   Squid or Varnish for reverse proxy ca...
Do the math
‣   All non-CDN traffic travels through your load
    balancers and reverse proxy caches. Even traffic
    passed ...
Get a management/monitoring box
                Load        (maybe two or three
               Balancer          and have ...
Planning + Scoping
Infrastructure goals
‣   Redundancy
‣   Scalability
‣   Performance
‣   Manageability
Redundancy
‣   When one server fails, the website should
    be able to recover without taking too long.
‣   This requires...
Performance
‣   Find the “sweet spot” for hardware. This is the
    best price/performance point.
‣   Avoid overspending o...
Relative importance
                  Processors/Cores       Memory   Disk Speed


 Reverse Proxy
    Cache         ●     ...
Reverse proxy caches
‣   Squid makes poor use of multiple cores. Focus on
    getting the highest per-core performance. Th...
Web servers
‣   Apache 2.2 + mod_php + memcached
‣   Many processors + many cores is best
‣   25 Apache threads per core
‣...
Database servers
‣   MySQL 5.0 cannot use more than eight cores
    effectively but gets good gains from at least quad-
   ...
Monitoring server
‣   Very low hardware requirements
‣   Choose hardware that is inexpensive but
    essentially similar t...
Assembling the numbers
‣   Start with an architecture providing redundancy.
    ‣   Two servers, each running the whole st...
Pressflow
Make Drupal sites scale by upgrading core
with a compatible, powerful replacement.
Common large-site issues
‣   Drupal core requires patching to effectively
    support the advanced scalability techniques
 ...
What is Pressflow?
‣   Pressflow is a derivative of Drupal core that
    integrates the most popular performance and
    sca...
What are the enhancements?
‣   Reverse proxy support
‣   Database replication support
‣   Lower database and session manag...
Four Kitchens + Tag1
‣   Provide the development, support, scalability, and
    performance services behind Pressflow
‣   C...
Ready to scale?
‣   Learn more about Pressflow:
    ‣   Pick up pamphlets in the lobby
    ‣   Request Pressflow releases at...
Managing the Cluster
The problem
                            Soware and
                            Configuration




Application   Application...
Manual updates and deployment

   Human         Human         Human         Human         Human




 Application   Applica...
Shared storage
 Application   Application   Application   Application   Application
   Server        Server        Server ...
rsync
                             Synchronized
                              with rsync




 Application   Application   ...
Capistrano
                               Deployed with
                                Capistrano




   Application   Ap...
Multistage deployment
                          Deployments
 Deployed with                                               D...
But your application isn’t the only
        thing to manage.
Beneath the application
  Reverse
                             Cluster-level
   Proxy                                     ...
System configuration management
‣   Deploys and updates packages, cluster-wide or
    selectively.
‣   Manages arbitrary te...
All on the management box




                   {
                       Development
                        Integration
...
Monitoring
Types of monitoring
        Failure           Capacity/Load

   Analyzing Downtime    Analyzing Trends

    Viewing Failov...
Everyone needs both.
What to use

    Failure/Uptime   Capacity/Load

       Nagios            Cacti

       Hyperic          Munin
Nagios
‣   Highly recommended.
‣   Used by Four Kitchens and Tag1 Consulting for
    client work, Drupal.org, Wikipedia, e...
Hyperic
‣   I haven’t used this much, but it’s fairly popular.
‣   More difficult to set up than Nagios.
Cacti
‣   Highly annoying to set up.
‣   One instance generally collects all statistics.
    (No “agents” on the systems b...
Munin
‣   Fairly easy to set up.
‣   One instance generally collects all statistics.
    (No “agents” on the systems being...
Cluster Problems
Cache/session coherency
‣   Systems that run properly on single boxes may
    lose coherency when run on a networked clust...
Cache regeneration races
‣   Downside to network cache coherency: synched
    expiration
‣   Hard to solve
               ...
Broken replication
‣   MySQL slave servers get out of synch, fall further
    behind
‣   No means of automated recovery
‣ ...
Server failure
‣   Load balancers can remove broken or overloaded
    application reverse proxy caches.
‣   Reverse proxy ...
All content in this presentation, except where noted otherwise, is Creative Commons Attribution-
ShareAlike 3.0 licensed a...
Upcoming SlideShare
Loading in …5
×

Scalable Drupal infrastructure

4,481 views
4,334 views

Published on

A guide to planning, deploying, and scaling big websites using Drupal.

For more Four Kitchens presentations, please visit http://fourkitchens.com/presentations

Published in: Technology

Scalable Drupal infrastructure

  1. 1. Designing, Scoping, and Configuring Scalable Drupal Infrastructure Presented 2009-05-30 by David Strauss
  2. 2. Understanding Load Distribution
  3. 3. Predicting peak traffic Traffic over the day can be highly irregular. To plan for peak loads, design as if all traffic were as heavy as the peak hour of load in a typical month -- and then plan for some growth.
  4. 4. Analyzing hit distribution 40% 30% Hu man e nt nt 3% icC o 50% t Sta t en W t m eb rea al T Cr 100% ci aw pe s ou S le No r ym 10% on Dy n An am “P i cP ay W ag al l” es By pa ss 70% Auth entic ated 7% 20%
  5. 5. Throughput vs. Delivery Methods Green Yellow Red (Static) (Dynamic, Cacheable) (Dynamic) 2 Content Delivery Network ●●●●●●●●●● ✖ ✖ Reverse Proxy Cache ●●●●●●● ●●●●●●● ✖ 1000 req/s 1 Drupal + Page Cache + memcached ●●● ●●● ✖ 1 Drupal + Page Cache ●●● ●● ✖ 1 Drupal ●●● ● ● 10 req/s 1 Delivered by Apache without Drupal More dots = More throughput 2 Some actually can do this.
  6. 6. Objective Deliver hits using the fastest, most scalable method available.
  7. 7. Layering: Less Traffic at Each Step Your Datacenter Load Reverse Application Traffic Balancer Proxy Server Cache DNS Round Robin CDN Database
  8. 8. Offload from the master database Search Your master database is the single greatest limitation on scalability. Application Slave Server Database Master Memory Cache Database
  9. 9. Tools to use ‣ Apache Solr for search. (Acquia offers hosting of this now.) ‣ Squid or Varnish for reverse proxy caching. ‣ Any third-party service for CDN.
  10. 10. Do the math ‣ All non-CDN traffic travels through your load balancers and reverse proxy caches. Even traffic passed through to application servers must run through the initial layers. Load Reverse Application Traffic Balancer Proxy Server Cache What hit rate is each layer geing? How many servers share the load?
  11. 11. Get a management/monitoring box Load (maybe two or three Balancer and have them specialized or redundant) Application Database Management Server Reverse Proxy Cache
  12. 12. Planning + Scoping
  13. 13. Infrastructure goals ‣ Redundancy ‣ Scalability ‣ Performance ‣ Manageability
  14. 14. Redundancy ‣ When one server fails, the website should be able to recover without taking too long. ‣ This requires N+1, putting a floor on system requirements. ‣ How long can your site be down? ‣ Automatic versus manual failover
  15. 15. Performance ‣ Find the “sweet spot” for hardware. This is the best price/performance point. ‣ Avoid overspending on any type of component ‣ Yet, avoid creating bottlenecks ‣ Swapping memory to disk is very dangerous
  16. 16. Relative importance Processors/Cores Memory Disk Speed Reverse Proxy Cache ● ●●● ●● Web Server ●●●●● ●● ● Database Server ●● ●●●● ●●●● Monitoring ● ● ●
  17. 17. Reverse proxy caches ‣ Squid makes poor use of multiple cores. Focus on getting the highest per-core performance. The best per-core performance is often on dual-core processors with high clock rates and lots of cache. ‣ Varnish is much more multithreaded. ‣ 4-8 GB memory, total ‣ Expect 1000 requests per second, per Squid ‣ 64-bit operating system if more than 2 GB RAM
  18. 18. Web servers ‣ Apache 2.2 + mod_php + memcached ‣ Many processors + many cores is best ‣ 25 Apache threads per core ‣ 50 MB memory per thread, system-wide ‣ 1 GB memory for system ‣ 1 GB memory for memcached ‣ Configure MaxClients in Apache to maximum system-wide thread count ‣ Expect 1 request per thread, per second
  19. 19. Database servers ‣ MySQL 5.0 cannot use more than eight cores effectively but gets good gains from at least quad- core processors. ‣ Depend on each Apache thread needing one connection, and add another 50. ‣ Each MySQL connection needs around 6 MB. ‣ MySQL with InnoDB needs a buffer pool large enough to cache all indexes. Start by giving the pool most remaining database server memory and working from there. ‣ 64-bit operating system if more than 2 GB RAM
  20. 20. Monitoring server ‣ Very low hardware requirements ‣ Choose hardware that is inexpensive but essentially similar to the rest of the cluster to reduce management overhead ‣ Reliability and fast failover are typically low priorities for monitoring services
  21. 21. Assembling the numbers ‣ Start with an architecture providing redundancy. ‣ Two servers, each running the whole stack ‣ Increase the number of proxy caches based on anonymous and search engine traffic. ‣ Increase the number of web servers based on authenticated traffic. ‣ Databases are harder to predict, but large sites should run them on at least two separate boxes with replication.
  22. 22. Pressflow Make Drupal sites scale by upgrading core with a compatible, powerful replacement.
  23. 23. Common large-site issues ‣ Drupal core requires patching to effectively support the advanced scalability techniques discussed here. ‣ Patches often conflict and have to be reapplied with each Drupal upgrade. ‣ The original patches are often unmaintained. ‣ Sites stagnate, running old, insecure versions of Drupal core because updating is too difficult.
  24. 24. What is Pressflow? ‣ Pressflow is a derivative of Drupal core that integrates the most popular performance and scalability enhancements. ‣ Pressflow is completely compatible with existing Drupal 5 and 6 modules, both standard and custom. ‣ Pressflow installs as a drop-in replacement for standard Drupal. ‣ Pressflow is free as long as the matching version of Drupal is also supported by the community.
  25. 25. What are the enhancements? ‣ Reverse proxy support ‣ Database replication support ‣ Lower database and session management load ‣ More efficient queries ‣ Testing and optimization by Four Kitchens with standard high-performance software and hardware configuration ‣ Industry-leading scalability support by Four Kitchens and Tag1 Consulting
  26. 26. Four Kitchens + Tag1 ‣ Provide the development, support, scalability, and performance services behind Pressflow ‣ Comprise most members of the Drupal.org infrastructure team ‣ Have the most experience scaling Drupal sites of all sizes and all types
  27. 27. Ready to scale? ‣ Learn more about Pressflow: ‣ Pick up pamphlets in the lobby ‣ Request Pressflow releases at fourkitchens.com ‣ Get the help you need to make it happen: ‣ Talk to me (David) or Todd here at DrupalCamp ‣ Email shout@fourkitchens.com
  28. 28. Managing the Cluster
  29. 29. The problem Soware and Configuration Application Application Application Application Application Server Server Server Server Server Objectives: Fast, atomic deployment and rollback Minimize single points of failure and contention Restart services Integrate with version control systems
  30. 30. Manual updates and deployment Human Human Human Human Human Application Application Application Application Application Server Server Server Server Server Why not: slow deployment, non-atomic/difficult rollbacks
  31. 31. Shared storage Application Application Application Application Application Server Server Server Server Server NFS Why not: single point of contention and failure
  32. 32. rsync Synchronized with rsync Application Application Application Application Application Server Server Server Server Server Why not: non-atomic, does not manage services
  33. 33. Capistrano Deployed with Capistrano Application Application Application Application Application Server Server Server Server Server Capistrano provides near-atomic deployment, service restarts, automated rollback, test automation, and version control integration (tagged releases).
  34. 34. Multistage deployment Deployments Deployed with Deployed with Capistrano can be staged. Capistrano cap staging deploy cap production deploy Development Integration Deployed with Staging Capistrano Application Application Application Application Application Server Server Server Server Server
  35. 35. But your application isn’t the only thing to manage.
  36. 36. Beneath the application Reverse Cluster-level Proxy Database configuration Cache Application Application Application Application Application Server Server Server Server Server Cluster management applies to package management, updates, and soware configuration. cfengine and bcfg2 are popular cluster-level system configuration tools.
  37. 37. System configuration management ‣ Deploys and updates packages, cluster-wide or selectively. ‣ Manages arbitrary text configuration files ‣ Analyzes inconsistent configurations (and converges them) ‣ Manages device classes (app. servers, database servers, etc.) ‣ Allows confident configuration testing on a staging server.
  38. 38. All on the management box { Development Integration Staging Management Deployment Tools Monitoring
  39. 39. Monitoring
  40. 40. Types of monitoring Failure Capacity/Load Analyzing Downtime Analyzing Trends Viewing Failover Predicting Load Troubleshooting Checking Results of Configuration and Notification Soware Changes
  41. 41. Everyone needs both.
  42. 42. What to use Failure/Uptime Capacity/Load Nagios Cacti Hyperic Munin
  43. 43. Nagios ‣ Highly recommended. ‣ Used by Four Kitchens and Tag1 Consulting for client work, Drupal.org, Wikipedia, etc. ‣ Easy to install on CentOS 5 using EPEL packages. ‣ Easy to install nrpe agents to monitor diverse services. ‣ Can notify administrators on failure. ‣ We use this on Drupal.org
  44. 44. Hyperic ‣ I haven’t used this much, but it’s fairly popular. ‣ More difficult to set up than Nagios.
  45. 45. Cacti ‣ Highly annoying to set up. ‣ One instance generally collects all statistics. (No “agents” on the systems being monitored.) ‣ Provides flexible graphs that can be customized on demand. ‣ Optimized database for perpetual statistics collection. ‣ We use this on Drupal.org and for client sites.
  46. 46. Munin ‣ Fairly easy to set up. ‣ One instance generally collects all statistics. (No “agents” on the systems being monitored.) ‣ Provides static graphs that cannot be customized.
  47. 47. Cluster Problems
  48. 48. Cache/session coherency ‣ Systems that run properly on single boxes may lose coherency when run on a networked cluster. ‣ Some caches, like APC’s object cache, have no ability to handle network-level coherency. (APC’s opcode cache is safe to use on clusters.) ‣ memcached, if misconfigured, can hash values inconsistently across the cluster, resulting in different servers using different memcached instances for the same keys. ‣ Session coherency can be helped with load balancer affinity.
  49. 49. Cache regeneration races ‣ Downside to network cache coherency: synched expiration ‣ Hard to solve All servers regenerating the item. Old Cached Item Expiration { New Cached Item Time
  50. 50. Broken replication ‣ MySQL slave servers get out of synch, fall further behind ‣ No means of automated recovery ‣ Only solvable with good monitoring and recovery procedures ‣ Can automate removal from use, but requires cluster management tools
  51. 51. Server failure ‣ Load balancers can remove broken or overloaded application reverse proxy caches. ‣ Reverse proxy caches like Varnish can automatically use only functional application servers. ‣ Cluster management tools like heartbeat2 can manage service IPs on MySQL servers to automate failover. ‣ Conclusion: Each layer intelligently monitors and uses the servers beneath it.
  52. 52. All content in this presentation, except where noted otherwise, is Creative Commons Attribution- ShareAlike 3.0 licensed and copyright 2009 Four Kitchen Studios, LLC.

×