Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Operational Best Practices in the Cloud


Published on

RightScale Webinar: Don’t pave the cow path. Cloud infrastructure is very different from traditional infrastructure and requires different approaches to really harness cloud value. From dev/test/prod lifecycle management to deployment automation, patch management, monitoring and automation for autoscaling and disaster recovery... we’ll provide insight into how we automate and manage cloud servers at RightScale to avoid having to get hands on. Especially at 3am.

Published in: Technology, Business
  • Be the first to comment

Operational Best Practices in the Cloud

  1. 1. Operational Best Practices in the Cloud October 27, 2011 Watch the video of this webinar
  2. 2. 2#Your Panel TodayPresenting• Rafael H. Saavedra, VP Engineering, RightScale• Josep Blanquer, Sr. Systems Architect, RightScaleQ&A• David Manriquez, Account Manager, RightScalePlease use the “Questions” window to ask questions any time! Cloud Management Platform
  3. 3. 3#Agenda• RightScale architecture• The release cycle• Monitoring, alerts and escalations• When servers fail• Our best practicesToday’s material will discuss how we run RightScale in the cloud.From this, we distill best practices that are relevant for all.Please use the “Questions” window to ask questions any time! Cloud Management Platform
  4. 4. Operational Best Practices in the CloudRightScale architecture
  5. 5. 5#The scale of RightScale• > 3M servers launched by RightScale• RightScale continuously monitors > 100K servers• Every day at RightScale: • 2,000 array resize actions are executed • 35,000 alert escalations are triggered • 20,000 escalation emails are sent to users • 9.0TB of monitoring data is exchanged with our servers • 1.6TB of logging data is sent to our servers Cloud Management Platform
  6. 6. 6#Architecture of a cloud-based SaaS app• RightScale is a SaaS application that runs completely in the cloud • Databases • Core web app and API • Services such as monitoring, logging, and MultiCloud Marketplace Cloud Management Platform
  7. 7. 7#A quick primer on ServerTemplates Configuring servers through bundling Images: Configuring servers with ServerTemplates: Custom MySQL 5.0.24 (CentOS 5.2) Custom MySQL 5.0.24 (CentOS 5.4) MySQL 5.0.36 (CentOS 5.4) Setup DNS and IPs MySQL 5.0.36 (Ubuntu 8.10) boot sequence MySQL 5.0.36 (Ubuntu 8.10) 64bit A set Restore last backup of configuration directives that will install and Frontend Apache 1.3 (Ubuntu 8.10) configure Configure MySQL of software on top Frontend Apache 2.0 (Ubuntu 9.10) - patched the base image CMS v1.0 (CentOS 5.4) Install MySQL Server CMS v1.1 (CentOS 5.4) Install monitoring My ASP appserver (windows 2008) My (windows 2008) – security update 1 Base Image My (windows 2008) – security update 8 MultiCloudImage Very few and basic SharePoint v4 (windows 2003) – 32bit SharePoint v4 (windows 2003) –64bit SharePoint v4.5 (windows 2003) –64bit CentOS 5.2 Ubuntu 8.10 Win 2003 CentOS 5.4 Ubuntu 9.10 Win 2007 … Cloud Management Platform
  8. 8. 8#We use the same ServerTemplates ourcustomers do• RightScale uses 15-20 different ServerTemplates in Production • We don’t build images, we use pre-built MultiCloud Images with RightLink • We make heavy use of RightScale provided tool boxes (EBS, DNS, LB)• Off-the shelf: 1 template (MySQL)• Customized: App servers and load balancers • Written with RightScripts in Ruby, Bash, etc. • Mostly Rail apps to run our core services: front-end, API, Marketplace, etc.• From MultiCloud Image: Messaging and databases • RabbitMQ, Cassandra Cloud Management Platform
  9. 9. 9#Deployments group RightScale services Cloud Management Platform
  10. 10. 10#Best practices: Architecture• ServerTemplates can be used off the shelf or customized • Don’t bundle images • Make heavy use of MCI’s instead of hardcoding base RightImages• Deployments let you stage servers in the cloud • The use of inputs guarantee consistency across all servers • Easily test or failover • Macros/API automation can quickly stand up entire deployments Cloud Management Platform
  11. 11. Operational Best Practices in the Cloud The release cycle
  12. 12. 12#Challenges of the release cycle • Limited resources and lead time for procuring and provisioning equipment • Maintaining multiple environments from development through production • Maintaining consistency for reusability and QA • Distributed teams and team members Cloud Management Platform
  13. 13. 13#A typical release cycle flow Cloud Management Platform
  14. 14. 14#Our development environment• We keep a number of different deployments • Each development team has its own mini-environment • A larger integrated staging environment • One production environment• Accounts keep things organized and secure • We keep a separate accounts for staging and production • One team of sys admins manage all environments Cloud Management Platform
  15. 15. 15#RightScale release cycle• One set of scripts and ServerTemplates are used everywhere • Gate accounts for security, development vs. production, etc. • Less test variance between Production and Staging • Only difference is size of environment• Easy to bring up development environment on demand using deployments and macros • Get it up and running, on demand in less than an hour • Cloud is pay-by-the-hour, so it is cheap to run temporary environments Cloud Management Platform
  16. 16. 16#Best practices: Release cycle• Don’t be afraid to run many environments • Dynamically clone, launch and teardown environments for quick tests • Configure a fixed set of environment for development, integration, staging • Use different accounts to segregate users and configurations. • Sys admins are expensive. Cloud servers are cheap.• Reuse ServerTemplates to keep environments consistent • Make use of the versioning and freeze software repositories • Share or Publish them through the MultiCloud Marketplace • Create all-in-one ServerTemplates from the same RightScripts and recipes• Avoid upgrading existing servers, fail forward instead • Keep old servers running so you can rollback, or do post-mortem later on • For databases: Launch additional slaves. Freeze replication at upgrade point. Take snapshots! Cloud Management Platform
  17. 17. 17#Release night steps 2) Servers with new code 7) Take snapshot at cutoff Main App 9) Reconnect 10) Open access all servers to site 8) Update schema Databases Front Ends DB Master DB Slave Main App DB Slave 3) Add second slave 4) Cut access 6) Stop replication to site 5) Stop all access to databases 1) Servers with current code Cloud Management Platform
  18. 18. Operational Best Practices in the CloudMonitoring, alerts and escalations
  19. 19. 19#Monitoring and alerts: Diagnose & optimize• Off-the-shelf monitoring • OS: CPU, Disk, Memory, Network, Processes, System • App: Apache, IIS, MySQL, Nginx, SQL Server • Plus many more CollectD plug-ins!• Custom monitoring• Cluster monitoring• Alerts & escalations Cloud Management Platform
  20. 20. 20#Monitoring, alerts & escalations• We monitor as much relevant data as possible and display it in insightful ways to quickly detect patterns and abnormalities• We proactively eliminate the conditions that raise critical alerts • No broken windows policy. No critical alerts can remain unresolved. API Network Activity Dashboard Network Activity Cloud Management Platform
  21. 21. 21#Off-the-shelf: MySQL Collectd Plugin Cloud Management Platform
  22. 22. 22#Off-the-shelf: MySQL reads graphs• Read-random-next represents a table scan• Read-next represents an index scan Cloud Management Platform
  23. 23. 23#Custom: Whatever you want with collectd• Any statistic you can think of can easily be added as a monitor.• All of these are graph-able and alert-able in our dashboard!• Many can be written in less than an hour. • As easy as printing a line of formatted numbers every few seconds• is an authority on collectd• How we do it: • We use Ruby to write our custom monitors • Cassandra: jcollectd with JMX to pull out monitoring data from JavaBeans • Passenger: Ruby script that parses data from Passenger command line interface Cloud Management Platform
  24. 24. 24#Custom: Cassandra monitors Cloud Management Platform
  25. 25. 25#Cluster: Monitor hundreds of servers • We leverage a monitoring data warehouse to develop heat maps & stacked graphs Cloud Management Platform
  26. 26. 26#Automated actions using alerts from monitors• Create an alert for any monitor, even your custom ones • RightScale example: Cassandra pending reads signals overloading• Break alerts into critical and warning • Critical: Wake me up! Page me! • Warning: Send email to team.• Trigger many actions: email, run script, scale, relaunch, reboot,… • Customize to your monitor, situation, and IT processes • RightScale example: Run a RightScript if swap is too high • Integrate with 3rd party services like PagerDuty Cloud Management Platform
  27. 27. 27#Best practices: Monitoring and alerts• Monitor your critical processes off-the-shelf • Set monitors with scripts on your ServerTemplates • Use mon_process (e.g. Ruby)• Customize to your application needs • Use collectd plug-ins or easily build your own • The monitor is graphed in the RightScale dashboard• Plan out your critical alerts • Set your response plan: warnings vs. critical Cloud Management Platform
  28. 28. Operational Best Practices in the Cloud When servers fail
  29. 29. 29#How to think about server failure in the cloud• Design for failure • Make sure your application remains healthy after the failure of a node • Don’t use sticky sessions • Distribute your application services• Debug ServerTemplates and not servers• Use alerts to reboot and/or relaunch• Auto-scale app server arrays• Use dynamic DNS and static IPs for load balancers • Your app servers and databases will always know where to look Cloud Management Platform
  30. 30. 30#Deep dive on database failure• Use database backups for rollbacks or disaster scenarios • Restore from backups in event of complete system failure • One-click with fully automated RightScale Database Managers• Use database redundancy for high availability (example master/slave) • Promote slave if master fails • Possible to prime your slave database to make failover more seamless • After promotion is complete, quick to launch a new slave • Worry about troubleshooting when you have time • One-click with fully automated RightScale Database Managers Cloud Management Platform
  31. 31. 31#Backups to block volumes and object stores• Block volumes: EBS snapshots • Object stores: S3/Cloud Files • + Easy to snapshot • + Backup into other clouds • + Easy to rotate • + Backup individual folders or files • + Easy consistency • + Incremental backups (e.g. as • + Instant restore (mount) files/data are flushed) • - Difficult to move between • - More coding, customization clouds/regions • - Custom rotation strategy • - Must backup entire volume • - Download time• What we do: • What we do: • EBS: Databases • S3: Monitoring system (Cassandra in the future) Cloud Management Platform
  32. 32. 32#Best practices: Planning for failure• No excuse for not backing up your servers • RightScale Database Manager + EBS tools make it easy to take backups• Plan your rotation policy • Database Manager helps you tailor daily, weekly, and monthly backups• Backup across clouds and regions • Database Manager for MySQL and SQL Server make it easy to backup to S3 or CloudFiles from AWS, CloudStack, Eucalyptus, and Rackspace• Organize your backups • Keep track with lineages and timelines using the Database Managers• Test your backups! • It is easy and cheap on the cloud • A crisis is the worst time to find out your backups are corrupted Cloud Management Platform
  33. 33. Operational Best Practices in the Cloud Our best practices
  34. 34. 34#Best practices for operating in the cloud• Keep your environment organized and consistent • Accounts, deployments, ServerTemplates, and macros• Change and debug configurations not servers • ServerTemplates, MultiCloudImages, fail-forward• Monitor your servers efficiently • Off-the-shelf and custom monitoring and alerts• Automate, automate and also automate • Server arrays, macros/API for more complex flows, alert actions …• Backup your databases (organize, multi-cloud, rotate, test) • Database Manager ServerTemplates Cloud Management Platform
  35. 35. 35#Getting Started and Q&AContact RightScale RightScale Conference(866) 720-0208 Nov 9 in Santa Clara, •Attend technical breakout •Talk with RightScale customers •Ask questions at the Expert Bar •Training on 11/8 and 11/10More InfoWebinar archive: Papers: Edition: Cloud Management Platform