Optimizing Your Cloud Applications in RightScale
 

Optimizing Your Cloud Applications in RightScale

on

  • 984 views

RightScale User Conference NYC 2011 - ...

RightScale User Conference NYC 2011 -

Rafael Saavedra - VP Engineering, RightScale

Performance tuning applications in the public cloud is both easier and harder than on your own server hardware. It's much easier to scale up and scale out in the cloud but you generally don't have much (if any) control over the hardware. With public cloud, you take the building blocks offered by the cloud infrastructure and design the application architecture to scale based on the capacity planning requirements and scalability testing results. In this session, we'll talk through our experiences scaling and performance tuning the RightScale platform in the cloud and share tips for sizing, auto-scaling, monitoring, and troubleshooting large-scale cloud deployments.

Statistics

Views

Total Views
984
Views on SlideShare
960
Embed Views
24

Actions

Likes
0
Downloads
35
Comments
0

2 Embeds 24

http://www.rightscale.com 22
http://localhost 2

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • The cluster monitoring is very powerful in that it provides different types of views into the operation of large clusters of servers
  • Walk through ofhow it works: in any deployment, go to the monitoring tab select servers select metric to plot familiar controls to switch time period and graph size displays one graph per server, here core1.rightscale.com through core8.rightscale.com in this example the graphs show cpu utilization for the past week, where blue is busy time and green is idle
  • Individual graphs only work for so many servers, they also don’t show what is happening as an aggregateStacked graphs stack the contribution of each server on top of one anotherWalk through what the graph shows
  • Stacked graphs are great to see the aggregate, but it is often difficult to see abnormal server behaviorHeat maps show many servers on one graph by plotting one horizontal bar per serverThe time axis is the same for all servers and it is shown at the bottom of the graphThe color of the bar shows the value of the metric for the serverWalk through the graphIt’s easy to see that there are 6 servers sharing the load, and two servers that are different
  • At scale this is how all this looks and comes togetherThis example is real, it shows an incident we had with our monitoring cluster a few months agoThis heat map shows 100 servers out of one of our monitoring clusters (we want to be vague here…)When there are more than 100 servers, the heat map shows a sampling of 100Describe the sampling: most recently launched, longest running, some of each server template, rest randomStory:This heat map plots I/O wait for our monitoring servers on a day where we suddenly received a number of alerts for a few serversThe heap map shows these servers clearly as red bands starting between 7am and 8amSo we could clearly see that something was going on with a small number of servers and that it started more or less at the same time on all themTo see what happened in aggregate, we can switch graph type…
  • This shows the same incident as on the previous slide, but with a timescale of a weekIt shows the number of servers handled by each monitoring server, i.e. each color bar shows one serverIt is easy to see that some customer launched a large number of servers right at the time the overload beganFurther investigation showed that due to a bug these servers were allocated unevenly across the cluster causing the overload’
  • The architecture behind the cluster monitoring is rather extensiveCustomer (i.e. your) servers send monitoring data every 20 seconds to our serversThe data points are cached in-memory on those servers and flushed to disk periodicallyCluster monitoring graphs are produced on separate front-end servers, which pull the data from over 100 monitoring storage serversThe graphs are produced using rrdtool and auto-refresh

Optimizing Your Cloud Applications in RightScale Optimizing Your Cloud Applications in RightScale Presentation Transcript

  • Optimizing Your Cloud Applications in RightScale
    Rafael H. Saavedra - VP Engineering, RightScale
    June 8th, 2011
  • Introduction
    3-tier application architecture
    Vertical & horizontal scaling
    RightScale monitoring and cluster graphs
    New Relic RPM
    Support for optimizing DB performance
    Miscellaneous
    Agenda
  • Multi-tenancy
    Shared resource pooling
    Geo-distribution and ubiquitous network access
    Service oriented
    Dynamic resource provisioning
    Self-organizing
    Utility based pricing
    Cloud computing characteristics
  • No upfront investment
    Lowering operating costs
    Highly scalable
    Easy access
    Reduces business risk and maintenance costs
    Cloud computing advantages
  • 3-tier application architecture
    Load balancers
    A farm of application servers
    Master-slave
  • Instance size (vertical scaling)
    Instance autoscaling (horizontal scaling)
    Server arrays
    RightScale support for performance optimization
    ServerTemplates are configured to capture performance data
    CollectdRightScripts
    Hardware & OS monitoring data
    Specialized plugins – MySQL, HAProxy, Apache, NgInx, IIS, etc
    Monitoring graphs: individual, cluster, stacked, heat maps
    Alerts & escalations
    New Relic RPM
    Cloud performance optimization
  • Compute units vs memory vs cost
    Scaling up – spectrum of instance sizes
  • Server arrays provide horizontal scaling
  • The array scales up or down based on performance votes
    Tags allow scaling on an arbitrary decision set
    Decision threshold controls reaction time
    Sleep time allows new resources to have an impact
    Scaling can be time dependent
    Detailed setup instructions: http://bit.ly/c1oLr2
    Fast response to changes in load conditions using alerts
    Allocation of servers to availability zones based on weights
    Deployment-based so configuration is consistent
    Arrays can be pre-scaled to support anticipated demand
    Server arrays provide horizontal scaling
  • Cluster monitoring
    Individual graphs
    Good for a dozen servers
    Displays all standard graphs with full detail
    Stacked graphs
    Displays the contribution of many servers to a total
    Great to see the sum and variability of activity in a cluster
    Difficult to make out individual servers
    Examples: requests/sec, cpu busy cycles, I/O bytes/sec
    Heat maps
    Displays a bar for each server
    Great to see uneven distribution across servers
    Great to quickly spot performance problems across many servers
    Difficult to read absolute values or see the total cluster activity
  • Cluster monitoring
    Current cluster monitoring: one graph per server
  • Stacked graphs
    Each color band shows contribution of one server
    Servers are stacked on top of one another
  • Heat maps
    Each horizontal strip shows one server
    The color shows how “hot” the server is running
  • Heat map with 100 servers
  • Stacked graph of the same 100 servers
  • Cluster monitoring architecture
    Architecture
    Monitoring front-end serverspull data from storage servers
    Up to 100 servers on one graph(to be increased)
    monitoring
    storage
    servers
    monitoring
    front-end
    servers
    your servers
  • Real-Time App Performance Analytics
    Supports Ruby, PHP, Java & .Net
    SQL & NoSQL performance
    Web transaction tracing
    Performance notifications
    Availability monitoring
    Scalability analysis
    New Relic RPM
  • New Relic RPM
    Direct access from RightScale dashboard
  • New Relic RPM
    Historical statistics over a period of time
  • New Relic RPM
    Distribution of the most time consuming requests
  • New Relic RPM
    Statistics about response times from different countries
  • New Relic RPM
    Detailed response times by browser
  • An expensive query
    The N+1 query problem
    Finding patterns in similar requests
    New Relic RPM – 3 Examples
  • Optimizing DB performance
    RightScale MySQLServerTemplates
    Configuration files tailored to instance size
    innodb_buffer_pool_size
    key_buffer_size
    thread_size
    sort_buffer_size
    The never ending task of identifying current bottlenecks
    Disk seeks
    Performance of disk operations
    Scale up when working set cannot fit in memory – avoid active swapping
    Constant monitoring of performance graphs, logs and query
    Schema considerations
  • Schema considerations
    Lookups need to be indexed
    Sorting requires an index
    Joins need to be done on indices
    Become slower as tables grow
    Compounded indices should be used consistently
    Do not abuse indices
    Each index requires a disk write
    Compact tables if they become fragmented
    Deleted rows do not remove the corresponding index entries
  • Monitoring DB performance
    Standard collectd statistics
    User vs wait time (disk operations)
    Performance of disk operations
    Scale up when working set cannot fit in memory
    MySQLcollectdplugin
    Monitor INSERT, SELECT, UPDATE operations
    The breakdown of read operations can indicate missing indices
    Monitoring /var/log/mysql-slow.log file
    Identify slow queries
    Use MySQL EXPLAIN command to identify query plan
  • MySQLCollectdPlugin
    Uses MySQL SHOW STATUS command to collect statistics
    A large set of counters that are divided into 10 categories
    Connections
    IO Requests
    Select Rates
    Read Rates
    Key Rates
    Commands Rates
    Query Cache
    Tables
    Memory
    Misc.
  • MySQLCollectdPlugin
    Uses MySQL SHOW STATUS command to collect statistics
  • Mysql-slow.log & explain command
    # Time: 101006 23:30:11
    # User@Host: prod[prod] @ domU-12-31-39-0F-D0-C1.compute-1.internal [10.193.211.47]
    # Query_time: 7 Lock_time: 0 Rows_sent: 1 Rows_examined: 19785
    SELECT * FROM `ec2_elastic_ips` WHERE (`ec2_elastic_ips`.ec2_instance_id = 6810144) LIMIT 1;
    mysql> explain select * FROM `ec2_elastic_ips` WHERE (`ec2_elastic_ips`.ec2_instance_id = 6810144) LIMIT 1 G
    *************************** 1. row ***************************
    id: 1
    select_type: SIMPLE
    table: ec2_elastic_ips
    type: ALL
    possible_keys: NULL
    key: NULL
    key_len: NULL
    ref: NULL
    rows: 33332
    Extra: Using where
    1 row in set (0.00 sec)
  • MySQL performance depends on locality
    Wait time should be minimum when working set fits in memory
    Performance degrades once wait time is significant
    wait time insignificant
    user time dominates
  • MySQL reads graphs
    Read-random-next represents a table scan
    Read-next represents an index scan
  • Misc load testing using httperf
    RightScale provides ServerTemplates in the marketplace
    https://my.rightscale.com/library/server_templates/Httperf-Load-Tester-11H1/18316
    Tutorial on httperf setup and configuration
    http://support.rightscale.com/03-Tutorials/02-AWS/E2E_Examples/E2E_Gaming_Deployment/Adding_Httperf_Load_Tester
  • Questions?
    Rafael Saavedra - VP Engineering, RightScale
    June 8th, 2011