Acquia Managed Cloud: Highly Available Architecture for Highly Unpredictable Traffic
Acquia Managed Cloud:!Highly Available Architecture for Highly Unpredictable Trafﬁc! Kieran Lal! Jess Iandiorio! Technical Director! Sr. Director, Cloud Product Marketing! Acquia! Acquia! January 19th, 2012!
Your Drupal Application Life Stages Set-up/Launch Production Crisis Build Application updates Diagnosis • Load balancers • Drupal App code • Site failure • Fast page cache • Infrastructure failure Infrastructure updates • App Servers • Application errors • OS • Database Resolution • Debugging • File systems • Security • Resize • Web servers Operations • Launch new virtual servers • App Configuration • Multi-region failover • HA architecture • 24X7 monitoring & alerts • Backups Deploy • Load testing • Integrated Git/SVN • Drag and drop content management2!
Capacity Planning Options Options Users hitting your site .010 Over Plan 1 .008 Over Pay .006 .004 .002 0 Jul Aug Sept Oct Nov Dec3!
Capacity Planning Options Options Users hitting your site .010 Over Plan 1 .008 Over Pay Under Plan .006 2 Expect Outages .004 .002 0 Jul Aug Sept Oct Nov Dec4!
Capacity Planning Options Options Users hitting your site .010 Over Plan 1 .008 Over Pay Under Plan .006 2 Expect Outages .004 Acquia Plan 3 No Failure .002 0 Jul Aug Sept Oct Nov Dec5!
Unpredictable Traffic Victims Events Businesses News/ M&E Organizations High Growth Sites Challenges Challenges Challenges • Plagued by prior event stats • You never know when you’ll be • Lack of experience/skill set • Failure extends beyond web “Huff Po’d” • No prior benchmarking data Consequences of • Time-to-market is critical Consequences of failure failure Consequences of failure • Missed opportunities • Sales (tickets) • Loss of credibility • Discouraged users • Brand Damage • Readership • Loss of confidence • Missed donation • Contractual failures per opportunities advertising agreements • Impact to the ad sales cycle6!
The Framework Planned Successfully Planned Unsuccessfully Unplanned 1 2 3 Test early, often Best Effort Not Enough “Crisis mode” Profile Profile Profile • Companies that are • Companies that plan to handle • Companies with truly volatile experienced with resizing it themselves but don’t have businesses exercises the “crisis” speed skill set • Mission-critical sites where • Allocate 3+ weeks for resizing • Web teams that have no prior failure isn’t an option exercises combined with load experience manually scaling servers • Web teams that haven’t testing invested in HA architecture • Don’t underestimate • Web teams who don’t have a triage plan in place for • Web teams that have separate administrative challenges evaluating application v. application and infrastructure infrastructure failures support • Companies that are unlucky7!
Planned Successfully 1 Test early, often Planned Successfully Profile • Advanced notice • Work with our team to develop a plan and load test it Acquia: • Plan development • Provision resources • Continuous monitoring day of event8!
Planned Successfully 1The King Center Test early, often9!
Planned Successfully 1The King Center Test early, often The Players! Customer: The King Center! Partner: Palantir, Soasta! Acquia: Sales, Operations, Support! Triage to Resolution: 3 Weeks!10!
Planned Unsuccessfully 2 Best Effort Not Enough Planned Unsuccessfully Profile • Advanced notice • Tried to plan for the “worst case scenario” • Planning fell short of worst case scenario Acquia: • Immediate detection & resolution of infrastructure issues11!
Planned Unsuccessfully 2The BRIT Awards Best Effort Not Enough12!
Planned Unsuccessfully 2The BRIT Awards Best Effort Not Enough The Players! Customer: The BRIT Awards! Acquia: Support, Operations, Cloud Engineering! Triage to Resolution: 20 minutes!13!
Planned Unsuccessfully 2Lilith Fair (RIP) Best Effort Not Enough14!
Unplanned 3 “Crisis mode” Unplanned Profile • No advanced notice • Resources not available • Site goes down • Panic Acquia: • Triage the issue – Code, attack or capacity? • Resolve15!
The Acquia Triage Checklist Determine nature of the problem 10 to 30 minutes Check monitoring Check logs Mitigate problem 30 minutes to 2+ hrs Code Roll back or remediate Attack DOS – Block offending IP DDOS – Bring in DOSarrest Resize Automatic: Server HA, Web/DB failover Manual: Clone site for internal testing (Nagios) Increase size of DB Faster load balancers Larger Varnish Page Caching File system updates (GlusterFS) Increase web servers25!
Underlying Elastic Technology Stack Caching Load Page Caching Load Balancing Balancer Each layer is Web Servers Drupal Modules composed of Drupal Application multiple Servers redundant PHP Caching servers. If one fails, MySQL File Storage there is little or no Data Services downtime! Memcache Email International Data Centers MonitoringSecure Infrastructure Amazon AWS Backups 27!
Multi-region replication & failoverFor Back-ups across Borders• Acquia can deploy instances in any Amazon EC2 regions: - US East - US West - Europe - Singapore - Japan• Who is this for? - Organizations who see significant risk hosting their sites out of one geographic location28!
Lessons Learned Planned Successfully Planned Unsuccessfully Unplanned 1 2 3 Test early, often Best Effort Not Enough “Crisis mode” How can I be successful? You need elastic infrastructure You need scaling automation You need a team that can do diagnosis You need 24X7 support Engage Acquia early and often29!
Conclusion Acquia won’t let you fail We have the talent & infrastructure in place to ensure you’re successful We’ll find the needle in a haystack, and ensure your best day will never be your worst Predictable outcomes for unpredictable businesses!30!
For more information about Managed Cloud Check out our website Speak to a Sales rep http://www.acquia.com/products-services/acquia-managed-cloud!31!
Questions• For more information visit: http://www.acquia.com• Contact us: email@example.com or 888.9.ACQUIA• Follow us: @acquia• Comments welcome:• Jess.iandiorio@Acquia.com• Kieran.Lal@Acquia.com !"#$%&()*+,-$.(.*/".#,-0(),11(+*(2"3*#(3"4( http://acquia.com/resources/recorded_webinars!