Operational Best Practices in the Cloud

2,737 views

Published on

RightScale Webinar: Don’t pave the cow path. Cloud infrastructure is very different from traditional infrastructure and requires different approaches to really harness cloud value. From dev/test/prod lifecycle management to deployment automation, patch management, monitoring and automation for autoscaling and disaster recovery... we’ll provide insight into how we automate and manage cloud servers at RightScale to avoid having to get hands on. Especially at 3am.

Published in: Technology, Business
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
2,737
On SlideShare
0
From Embeds
0
Number of Embeds
84
Actions
Shares
0
Downloads
77
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide
  • RightScale'sServerTemplates allow you to capture best practices for provisioning and automating cloud infrastructure.  In this breakout session, we will explore how you can leverage the RightScale platform to share ServerTemplates with others.  Specifically, we'll walk through the steps to share and update ServerTemplates across your organization.  We'll also show you how to publish ServerTemplates publicly for the whole world to use.  This topic is best for: IT members who are responsible for maintaining server configurations within the organization, developers who would like to share work product within their group or ISVs wishing to reach cloud users by publishing through RightScale.
  • The cluster monitoring is very powerful in that it provides different types of views into the operation of large clusters of servers
  • The cluster monitoring is very powerful in that it provides different types of views into the operation of large clusters of servers
  • RightScale'sServerTemplates allow you to capture best practices for provisioning and automating cloud infrastructure.  In this breakout session, we will explore how you can leverage the RightScale platform to share ServerTemplates with others.  Specifically, we'll walk through the steps to share and update ServerTemplates across your organization.  We'll also show you how to publish ServerTemplates publicly for the whole world to use.  This topic is best for: IT members who are responsible for maintaining server configurations within the organization, developers who would like to share work product within their group or ISVs wishing to reach cloud users by publishing through RightScale.
  • RightScale'sServerTemplates allow you to capture best practices for provisioning and automating cloud infrastructure.  In this breakout session, we will explore how you can leverage the RightScale platform to share ServerTemplates with others.  Specifically, we'll walk through the steps to share and update ServerTemplates across your organization.  We'll also show you how to publish ServerTemplates publicly for the whole world to use.  This topic is best for: IT members who are responsible for maintaining server configurations within the organization, developers who would like to share work product within their group or ISVs wishing to reach cloud users by publishing through RightScale.
  • More specifically, we hear the following challenges: (Again, use this to unearth where they are having challenges.) Limited resources – In almost every phase, limited hardware poses problems. In architecting new systems there are rarely enough resources to experiment with alternative architectures or new technologies. For developers, limited resources usually means sharing hardware for testing. Testers rarely have enough hardware or time to do all the testing they would like to do - full performance and load testing, testing on complete production architectures, or testing disaster recovery scenarios. And, delays in development often puts pressure on testers to do their work faster to still reach the same deadline. The inability to spin-up additional testing resources at these times causes quality to suffer. The result is that errors are found later in the cycle where they are more expensive to fix. Limited equipment also means staff are constantly provisioning, tearing down, and re-provisioning the same equipment. It takes time, and if environments are not completely wiped clean, additional errors are potentially introduced. Time to procure and provision equipment - As the load on IT departments increases and the release cycles shorten, the wait for equipment to be procured and provisioned takes time away from valuable work. One customer stated it took 3-5 weeks to procure and provision new hardware. Maintaining consistent environments – As code moves through development, test, staging and production, changes to configurations in one stage rarely make it back into earlier stages. As new code is implemented from environments that haven’t been updated, the same errors are re-introduced. Maintaining multiple environments – As if maintaining one consistent environment across many servers isn’t hard enough, most software requires testing on several different types of configurations – different versions of stacks, for different end user environments – one for each possible production scenario. For example, a software company may need to test their software on different operating systems or alongside various software packages. Most companies need to clone production environments to debug problems without impacting the current users.Whether it happens in development or QA - maintaining & reproducing environments is a time consuming task. If the task is distributed across multiple administrators, the coordination of changes made becomes challenging. If the task is consolidated under one administrator, there is a limit to the number of different environments s/he can reliably maintain.Distributed teams or team members – add collaboration requirements and exacerbate all of the issues mentioned.
  • With RightScale it’s easy to create consistent, reproducible configurations in each stage. In a typical development lifecycle, the systems architect creates a reference architecture that serves as a model for production, and then that architecture specifies what components are needed in each configuration.
  • The cluster monitoring is very powerful in that it provides different types of views into the operation of large clusters of servers
  • The cluster monitoring is very powerful in that it provides different types of views into the operation of large clusters of servers
  • RightScale'sServerTemplates allow you to capture best practices for provisioning and automating cloud infrastructure.  In this breakout session, we will explore how you can leverage the RightScale platform to share ServerTemplates with others.  Specifically, we'll walk through the steps to share and update ServerTemplates across your organization.  We'll also show you how to publish ServerTemplates publicly for the whole world to use.  This topic is best for: IT members who are responsible for maintaining server configurations within the organization, developers who would like to share work product within their group or ISVs wishing to reach cloud users by publishing through RightScale.
  • RightScale'sServerTemplates allow you to capture best practices for provisioning and automating cloud infrastructure.  In this breakout session, we will explore how you can leverage the RightScale platform to share ServerTemplates with others.  Specifically, we'll walk through the steps to share and update ServerTemplates across your organization.  We'll also show you how to publish ServerTemplates publicly for the whole world to use.  This topic is best for: IT members who are responsible for maintaining server configurations within the organization, developers who would like to share work product within their group or ISVs wishing to reach cloud users by publishing through RightScale.
  • The cluster monitoring is very powerful in that it provides different types of views into the operation of large clusters of servers
  • RightScale'sServerTemplates allow you to capture best practices for provisioning and automating cloud infrastructure.  In this breakout session, we will explore how you can leverage the RightScale platform to share ServerTemplates with others.  Specifically, we'll walk through the steps to share and update ServerTemplates across your organization.  We'll also show you how to publish ServerTemplates publicly for the whole world to use.  This topic is best for: IT members who are responsible for maintaining server configurations within the organization, developers who would like to share work product within their group or ISVs wishing to reach cloud users by publishing through RightScale.
  • The cluster monitoring is very powerful in that it provides different types of views into the operation of large clusters of servers
  • RightScale'sServerTemplates allow you to capture best practices for provisioning and automating cloud infrastructure.  In this breakout session, we will explore how you can leverage the RightScale platform to share ServerTemplates with others.  Specifically, we'll walk through the steps to share and update ServerTemplates across your organization.  We'll also show you how to publish ServerTemplates publicly for the whole world to use.  This topic is best for: IT members who are responsible for maintaining server configurations within the organization, developers who would like to share work product within their group or ISVs wishing to reach cloud users by publishing through RightScale.
  • The cluster monitoring is very powerful in that it provides different types of views into the operation of large clusters of servers
  • The cluster monitoring is very powerful in that it provides different types of views into the operation of large clusters of servers
  • The cluster monitoring is very powerful in that it provides different types of views into the operation of large clusters of servers
  • The cluster monitoring is very powerful in that it provides different types of views into the operation of large clusters of servers
  • The cluster monitoring is very powerful in that it provides different types of views into the operation of large clusters of servers
  • The cluster monitoring is very powerful in that it provides different types of views into the operation of large clusters of servers
  • The cluster monitoring is very powerful in that it provides different types of views into the operation of large clusters of servers
  • Operational Best Practices in the Cloud

    1. 1. Operational Best Practices in the Cloud October 27, 2011 Watch the video of this webinar
    2. 2. 2#Your Panel TodayPresenting• Rafael H. Saavedra, VP Engineering, RightScale• Josep Blanquer, Sr. Systems Architect, RightScaleQ&A• David Manriquez, Account Manager, RightScalePlease use the “Questions” window to ask questions any time! Cloud Management Platform
    3. 3. 3#Agenda• RightScale architecture• The release cycle• Monitoring, alerts and escalations• When servers fail• Our best practicesToday’s material will discuss how we run RightScale in the cloud.From this, we distill best practices that are relevant for all.Please use the “Questions” window to ask questions any time! Cloud Management Platform
    4. 4. Operational Best Practices in the CloudRightScale architecture
    5. 5. 5#The scale of RightScale• > 3M servers launched by RightScale• RightScale continuously monitors > 100K servers• Every day at RightScale: • 2,000 array resize actions are executed • 35,000 alert escalations are triggered • 20,000 escalation emails are sent to users • 9.0TB of monitoring data is exchanged with our servers • 1.6TB of logging data is sent to our servers Cloud Management Platform
    6. 6. 6#Architecture of a cloud-based SaaS app• RightScale is a SaaS application that runs completely in the cloud • Databases • Core web app and API • Services such as monitoring, logging, and MultiCloud Marketplace Cloud Management Platform
    7. 7. 7#A quick primer on ServerTemplates Configuring servers through bundling Images: Configuring servers with ServerTemplates: Custom MySQL 5.0.24 (CentOS 5.2) Custom MySQL 5.0.24 (CentOS 5.4) MySQL 5.0.36 (CentOS 5.4) Setup DNS and IPs MySQL 5.0.36 (Ubuntu 8.10) boot sequence MySQL 5.0.36 (Ubuntu 8.10) 64bit A set Restore last backup of configuration directives that will install and Frontend Apache 1.3 (Ubuntu 8.10) configure Configure MySQL of software on top Frontend Apache 2.0 (Ubuntu 9.10) - patched the base image CMS v1.0 (CentOS 5.4) Install MySQL Server CMS v1.1 (CentOS 5.4) Install monitoring My ASP appserver (windows 2008) My ASP.net (windows 2008) – security update 1 Base Image My ASP.net (windows 2008) – security update 8 MultiCloudImage Very few and basic SharePoint v4 (windows 2003) – 32bit SharePoint v4 (windows 2003) –64bit SharePoint v4.5 (windows 2003) –64bit CentOS 5.2 Ubuntu 8.10 Win 2003 CentOS 5.4 Ubuntu 9.10 Win 2007 … Cloud Management Platform
    8. 8. 8#We use the same ServerTemplates ourcustomers do• RightScale uses 15-20 different ServerTemplates in Production • We don’t build images, we use pre-built MultiCloud Images with RightLink • We make heavy use of RightScale provided tool boxes (EBS, DNS, LB)• Off-the shelf: 1 template (MySQL)• Customized: App servers and load balancers • Written with RightScripts in Ruby, Bash, etc. • Mostly Rail apps to run our core services: front-end, API, Marketplace, etc.• From MultiCloud Image: Messaging and databases • RabbitMQ, Cassandra Cloud Management Platform
    9. 9. 9#Deployments group RightScale services Cloud Management Platform
    10. 10. 10#Best practices: Architecture• ServerTemplates can be used off the shelf or customized • Don’t bundle images • Make heavy use of MCI’s instead of hardcoding base RightImages• Deployments let you stage servers in the cloud • The use of inputs guarantee consistency across all servers • Easily test or failover • Macros/API automation can quickly stand up entire deployments Cloud Management Platform
    11. 11. Operational Best Practices in the Cloud The release cycle
    12. 12. 12#Challenges of the release cycle • Limited resources and lead time for procuring and provisioning equipment • Maintaining multiple environments from development through production • Maintaining consistency for reusability and QA • Distributed teams and team members Cloud Management Platform
    13. 13. 13#A typical release cycle flow Cloud Management Platform
    14. 14. 14#Our development environment• We keep a number of different deployments • Each development team has its own mini-environment • A larger integrated staging environment • One production environment• Accounts keep things organized and secure • We keep a separate accounts for staging and production • One team of sys admins manage all environments Cloud Management Platform
    15. 15. 15#RightScale release cycle• One set of scripts and ServerTemplates are used everywhere • Gate accounts for security, development vs. production, etc. • Less test variance between Production and Staging • Only difference is size of environment• Easy to bring up development environment on demand using deployments and macros • Get it up and running, on demand in less than an hour • Cloud is pay-by-the-hour, so it is cheap to run temporary environments Cloud Management Platform
    16. 16. 16#Best practices: Release cycle• Don’t be afraid to run many environments • Dynamically clone, launch and teardown environments for quick tests • Configure a fixed set of environment for development, integration, staging • Use different accounts to segregate users and configurations. • Sys admins are expensive. Cloud servers are cheap.• Reuse ServerTemplates to keep environments consistent • Make use of the versioning and freeze software repositories • Share or Publish them through the MultiCloud Marketplace • Create all-in-one ServerTemplates from the same RightScripts and recipes• Avoid upgrading existing servers, fail forward instead • Keep old servers running so you can rollback, or do post-mortem later on • For databases: Launch additional slaves. Freeze replication at upgrade point. Take snapshots! Cloud Management Platform
    17. 17. 17#Release night steps 2) Servers with new code 7) Take snapshot at cutoff Main App 9) Reconnect 10) Open access all servers to site 8) Update schema Databases Front Ends DB Master DB Slave Main App DB Slave 3) Add second slave 4) Cut access 6) Stop replication to site 5) Stop all access to databases 1) Servers with current code Cloud Management Platform
    18. 18. Operational Best Practices in the CloudMonitoring, alerts and escalations
    19. 19. 19#Monitoring and alerts: Diagnose & optimize• Off-the-shelf monitoring • OS: CPU, Disk, Memory, Network, Processes, System • App: Apache, IIS, MySQL, Nginx, SQL Server • Plus many more CollectD plug-ins!• Custom monitoring• Cluster monitoring• Alerts & escalations Cloud Management Platform
    20. 20. 20#Monitoring, alerts & escalations• We monitor as much relevant data as possible and display it in insightful ways to quickly detect patterns and abnormalities• We proactively eliminate the conditions that raise critical alerts • No broken windows policy. No critical alerts can remain unresolved. API Network Activity Dashboard Network Activity Cloud Management Platform
    21. 21. 21#Off-the-shelf: MySQL Collectd Plugin Cloud Management Platform
    22. 22. 22#Off-the-shelf: MySQL reads graphs• Read-random-next represents a table scan• Read-next represents an index scan Cloud Management Platform
    23. 23. 23#Custom: Whatever you want with collectd• Any statistic you can think of can easily be added as a monitor.• All of these are graph-able and alert-able in our dashboard!• Many can be written in less than an hour. • As easy as printing a line of formatted numbers every few seconds• support.rightscale.com is an authority on collectd• How we do it: • We use Ruby to write our custom monitors • Cassandra: jcollectd with JMX to pull out monitoring data from JavaBeans • Passenger: Ruby script that parses data from Passenger command line interface Cloud Management Platform
    24. 24. 24#Custom: Cassandra monitors Cloud Management Platform
    25. 25. 25#Cluster: Monitor hundreds of servers • We leverage a monitoring data warehouse to develop heat maps & stacked graphs Cloud Management Platform
    26. 26. 26#Automated actions using alerts from monitors• Create an alert for any monitor, even your custom ones • RightScale example: Cassandra pending reads signals overloading• Break alerts into critical and warning • Critical: Wake me up! Page me! • Warning: Send email to team.• Trigger many actions: email, run script, scale, relaunch, reboot,… • Customize to your monitor, situation, and IT processes • RightScale example: Run a RightScript if swap is too high • Integrate with 3rd party services like PagerDuty Cloud Management Platform
    27. 27. 27#Best practices: Monitoring and alerts• Monitor your critical processes off-the-shelf • Set monitors with scripts on your ServerTemplates • Use mon_process (e.g. Ruby)• Customize to your application needs • Use collectd plug-ins or easily build your own • The monitor is graphed in the RightScale dashboard• Plan out your critical alerts • Set your response plan: warnings vs. critical Cloud Management Platform
    28. 28. Operational Best Practices in the Cloud When servers fail
    29. 29. 29#How to think about server failure in the cloud• Design for failure • Make sure your application remains healthy after the failure of a node • Don’t use sticky sessions • Distribute your application services• Debug ServerTemplates and not servers• Use alerts to reboot and/or relaunch• Auto-scale app server arrays• Use dynamic DNS and static IPs for load balancers • Your app servers and databases will always know where to look Cloud Management Platform
    30. 30. 30#Deep dive on database failure• Use database backups for rollbacks or disaster scenarios • Restore from backups in event of complete system failure • One-click with fully automated RightScale Database Managers• Use database redundancy for high availability (example master/slave) • Promote slave if master fails • Possible to prime your slave database to make failover more seamless • After promotion is complete, quick to launch a new slave • Worry about troubleshooting when you have time • One-click with fully automated RightScale Database Managers Cloud Management Platform
    31. 31. 31#Backups to block volumes and object stores• Block volumes: EBS snapshots • Object stores: S3/Cloud Files • + Easy to snapshot • + Backup into other clouds • + Easy to rotate • + Backup individual folders or files • + Easy consistency • + Incremental backups (e.g. as • + Instant restore (mount) files/data are flushed) • - Difficult to move between • - More coding, customization clouds/regions • - Custom rotation strategy • - Must backup entire volume • - Download time• What we do: • What we do: • EBS: Databases • S3: Monitoring system (Cassandra in the future) Cloud Management Platform
    32. 32. 32#Best practices: Planning for failure• No excuse for not backing up your servers • RightScale Database Manager + EBS tools make it easy to take backups• Plan your rotation policy • Database Manager helps you tailor daily, weekly, and monthly backups• Backup across clouds and regions • Database Manager for MySQL and SQL Server make it easy to backup to S3 or CloudFiles from AWS, CloudStack, Eucalyptus, and Rackspace• Organize your backups • Keep track with lineages and timelines using the Database Managers• Test your backups! • It is easy and cheap on the cloud • A crisis is the worst time to find out your backups are corrupted Cloud Management Platform
    33. 33. Operational Best Practices in the Cloud Our best practices
    34. 34. 34#Best practices for operating in the cloud• Keep your environment organized and consistent • Accounts, deployments, ServerTemplates, and macros• Change and debug configurations not servers • ServerTemplates, MultiCloudImages, fail-forward• Monitor your servers efficiently • Off-the-shelf and custom monitoring and alerts• Automate, automate and also automate • Server arrays, macros/API for more complex flows, alert actions …• Backup your databases (organize, multi-cloud, rotate, test) • Database Manager ServerTemplates Cloud Management Platform
    35. 35. 35#Getting Started and Q&AContact RightScale RightScale Conference(866) 720-0208 Nov 9 in Santa Clara, CAsales@rightscale.com www.RightScale.com/Conference •Attend technical breakout sessionswww.rightscale.com •Talk with RightScale customers •Ask questions at the Expert Bar •Training on 11/8 and 11/10More InfoWebinar archive: RightScale.com/webinarsWhite Papers: RightScale.com/whitepapersFree Edition: RightScale.com/Free Cloud Management Platform

    ×