What does performance mean in the
cloud?
What are the risks of moving to the cloud?




   IDC(Survey Q4 „09)                     Results from actual pilots (March 2010)
                                            Perception       Primary Benefits      Biggest Issues

                                          Before           Reduced IT costs      Security
   “The Maturing Cloud: What It
   Will Take to Win” (Published Mar       After            Scalability           Performance
   2010)
   What are the major risks in the                         Agility               SLA Management
   Cloud?

   • Security – 87.5%
   • Availability – 83.3%
   • Performance – 82.9%

   (88.6% stated that cloud
                                      “All About The Cloud” Conference (May 2010)
   service providers need to
                                      “Security in the Cloud isn‟t any harder than it is in the
   provide SLAs)
                                      Enterprise – it‟s just different” (Unisys)

                                      “[Application] Performance Management in the Cloud is
                                      becoming the hot topic” (THINKstrategies)


   Projects fail to deliver acceptable performance
   Moving Legacy Applications is harder than thought
What is Performance?
Performance ≠ Scalability




        The Cloud scales, but does it perform?
How do we measure Performance


   Response Time
      Transaction Level Metric
      Don’t use averages  High Volatility
      Be specific  Which type of transaction
   Throughput
      Volume of Transactions per Timeframe
      Average Speed of Transaction
      Be specific  Which type of transactions
What does Scalability mean


   More concurrent Transactions with same response time
   Linear growing Throughput with linear more hardware




          Scalability depends on Performance
Performance in the cloud


   “Pure Performance” is never better in a Cloud!
       Co Tenancy
       Resource sharing
       Commodity and generally smaller hardware
   Scalability can be better in the Cloud
       Rapid elasticity
       Depends on Application Design and Performance
       Legacy Applications have limitations
   End User Performance depends on both and more
       Web Delivery Chain
       Network!
       Can be better than on premise!
Performance Management in the
Cloud
Traditional Performance Management - Fails


   Sniffing and other appliances do not work
   Are based on System metrics which are
       Corrupted
       Do not answer application performance questions
   Are not manageable
       To many unrelated metrics
       Does not deal well Exponential Complexity Increase
Why is Cloud Monitoring not enough?


   Only System and High Level Response Metrics


   No Visibility into Application
    (Regressions, MTTR, Application Dependencies)


   No Visibility into End User Impact  Business Impact



  We need Application Focus
What we really care about

            Availability and Baseline
                  Performance




Web 2.0




                                Load Balancer   WebServer   Frontend(s)     Backend(s)   Private Datacenter




                                                                          Detailed Contribution
          End User Performance
                                                                                 Times
Key Challenge - Volatility




                                                Real vs. Measured
   Performance ^= F(Capacity)
                                           60


                             Utilization
                                           40
                                           20
                                            0
Measure Performance where it matters




                          But
                        slow is
                          bad


               Faster
               is not
               better
Understand your Transactions
End To End: Don„t forget the Chain




    User
    Click




                              On the Web
                                Server




                              In the
    In the
                            Application
    Cloud
Details, Details, Details, but be aware…

                                                 High
                                                Volatility




     Steal Time
     Shared I/O             Virtualization Aware Timers
     Shared network
Monitoring the Complex
Cloud Designs are simple, yet…


   Everything Fails!
   Tight Couple End User Delivery Components
       Few Tiers
       Response Time
       Scale Upfront




                                        100.000s
                                        users
Cloud Designs are simple, yet…


   Everything Fails!
   Tight Couple End User Delivery Components
       Few Tiers
       Response Time
       Scale Upfront
   Loosely Couple everything else
       Throughput
       Scale everything independent


      Simple Designs still lead to Complex Systems
         Complex Systems are hard to manage
Monitoring Complex Systems – Look at what matters
Context matters


   Too much Aggregation will blur the picture




                               Buying
                               Books




               Buying                            Context
               DVDs                              matters!




                                 Buying
                                  Cloth
Measure what Matters


   The Application and its Business Transactions
   Measure End User Performance
   Measure Throughput on Transaction Type Level
   How Performance effects your business
      e.g. Conversion Rate
      SLA Window
      Cost vs. Gain
   Prioritize based on what matters most
Identify cause of End User Impact




                            Flow of single
                             Transaction



            Response Time
              Hotspots
Cloud vs. Application




                 Application shows
                    otherwise
                                     Cloud Monitoring
                                     would show CPU
Application or Cloud Instance?



                                   Application Hotspots:
                                 CPU, Wait, I/O, Sync, Susp
                                         ension?




      Cause for Volatility?
putting Cloud Monitoring in Context

                                      Steal Time or
                                      out of CPU?




                                                      Cause for
                                                       Latency
We want to scale the Application and not the Cloud


   Auto Scaling on System metrics
       Is indirect and not goal oriented
       Fails when application changes


   Scale on Application Metrics and Application Components
       Transaction Load
       Response Time Contribution and Trend
       Throughput Goals
Rapid Deployment and Availability
Understand your Flow


   Understand the Application Flow
   Always Capture Performance Data
       Everything is transitory
       Reproducing problems is hard
       Analyze offline
   Identify Contributors
Automatically detect Regressions


   Deploy
   Compare
   Fix small
   Start again
Reacting Automatically to Issues


   Disk Latency Degradation
   Too much steal time
   Hardware Issues


   Detect “Application” Degradation



            Terminate!
           And start new
Make sure you are not blind


   Application Monitoring must be high available
       Outside and Inside
       Failover
       not in the same zone.
   Automated Deployments
   Zero Configuration Monitoring
Assessing Performance/Value
What is the goal?


   Performance and Scalability are not self serving
   “Desired” End User Experience
       Faster than that is not better
       Using less resources is cheaper!
A Price Performance Index


   Dollar Value for acceptable Performance:
          90th response time/(Total Cost/Number of Transactions)
          Desired Throughput/Total Cost
       Mind Volatility
       Price Performance Index is comparable
   Cost Scalability
       Cost per Transaction must remain stable


               Performance is not based on Capacity

                    It is a function of desired User
                  Experience and associated Cost
Questions




 Michael Kopp
 Michael.kopp@dynaTrace.com
 http://blog.dynatrace.com
 @mikopp

What does performance mean in the cloud

  • 1.
    What does performancemean in the cloud?
  • 2.
    What are therisks of moving to the cloud? IDC(Survey Q4 „09) Results from actual pilots (March 2010) Perception Primary Benefits Biggest Issues Before Reduced IT costs Security “The Maturing Cloud: What It Will Take to Win” (Published Mar After Scalability Performance 2010) What are the major risks in the Agility SLA Management Cloud? • Security – 87.5% • Availability – 83.3% • Performance – 82.9% (88.6% stated that cloud “All About The Cloud” Conference (May 2010) service providers need to “Security in the Cloud isn‟t any harder than it is in the provide SLAs) Enterprise – it‟s just different” (Unisys) “[Application] Performance Management in the Cloud is becoming the hot topic” (THINKstrategies)  Projects fail to deliver acceptable performance  Moving Legacy Applications is harder than thought
  • 3.
  • 4.
    Performance ≠ Scalability The Cloud scales, but does it perform?
  • 5.
    How do wemeasure Performance  Response Time  Transaction Level Metric  Don’t use averages  High Volatility  Be specific  Which type of transaction  Throughput  Volume of Transactions per Timeframe  Average Speed of Transaction  Be specific  Which type of transactions
  • 6.
    What does Scalabilitymean  More concurrent Transactions with same response time  Linear growing Throughput with linear more hardware Scalability depends on Performance
  • 7.
    Performance in thecloud  “Pure Performance” is never better in a Cloud!  Co Tenancy  Resource sharing  Commodity and generally smaller hardware  Scalability can be better in the Cloud  Rapid elasticity  Depends on Application Design and Performance  Legacy Applications have limitations  End User Performance depends on both and more  Web Delivery Chain  Network!  Can be better than on premise!
  • 8.
  • 9.
    Traditional Performance Management- Fails  Sniffing and other appliances do not work  Are based on System metrics which are  Corrupted  Do not answer application performance questions  Are not manageable  To many unrelated metrics  Does not deal well Exponential Complexity Increase
  • 10.
    Why is CloudMonitoring not enough?  Only System and High Level Response Metrics  No Visibility into Application (Regressions, MTTR, Application Dependencies)  No Visibility into End User Impact  Business Impact We need Application Focus
  • 11.
    What we reallycare about Availability and Baseline Performance Web 2.0 Load Balancer WebServer Frontend(s) Backend(s) Private Datacenter Detailed Contribution End User Performance Times
  • 12.
    Key Challenge -Volatility Real vs. Measured Performance ^= F(Capacity) 60 Utilization 40 20 0
  • 13.
    Measure Performance whereit matters But slow is bad Faster is not better
  • 14.
  • 15.
    End To End:Don„t forget the Chain User Click On the Web Server In the In the Application Cloud
  • 16.
    Details, Details, Details,but be aware… High Volatility  Steal Time  Shared I/O Virtualization Aware Timers  Shared network
  • 17.
  • 18.
    Cloud Designs aresimple, yet…  Everything Fails!  Tight Couple End User Delivery Components  Few Tiers  Response Time  Scale Upfront 100.000s users
  • 19.
    Cloud Designs aresimple, yet…  Everything Fails!  Tight Couple End User Delivery Components  Few Tiers  Response Time  Scale Upfront  Loosely Couple everything else  Throughput  Scale everything independent Simple Designs still lead to Complex Systems Complex Systems are hard to manage
  • 20.
    Monitoring Complex Systems– Look at what matters
  • 21.
    Context matters  Too much Aggregation will blur the picture Buying Books Buying Context DVDs matters! Buying Cloth
  • 22.
    Measure what Matters  The Application and its Business Transactions  Measure End User Performance  Measure Throughput on Transaction Type Level  How Performance effects your business  e.g. Conversion Rate  SLA Window  Cost vs. Gain  Prioritize based on what matters most
  • 23.
    Identify cause ofEnd User Impact Flow of single Transaction Response Time Hotspots
  • 24.
    Cloud vs. Application Application shows otherwise Cloud Monitoring would show CPU
  • 25.
    Application or CloudInstance? Application Hotspots: CPU, Wait, I/O, Sync, Susp ension? Cause for Volatility?
  • 26.
    putting Cloud Monitoringin Context Steal Time or out of CPU? Cause for Latency
  • 27.
    We want toscale the Application and not the Cloud  Auto Scaling on System metrics  Is indirect and not goal oriented  Fails when application changes  Scale on Application Metrics and Application Components  Transaction Load  Response Time Contribution and Trend  Throughput Goals
  • 28.
  • 29.
    Understand your Flow  Understand the Application Flow  Always Capture Performance Data  Everything is transitory  Reproducing problems is hard  Analyze offline  Identify Contributors
  • 30.
    Automatically detect Regressions  Deploy  Compare  Fix small  Start again
  • 31.
    Reacting Automatically toIssues  Disk Latency Degradation  Too much steal time  Hardware Issues  Detect “Application” Degradation Terminate! And start new
  • 32.
    Make sure youare not blind  Application Monitoring must be high available  Outside and Inside  Failover  not in the same zone.  Automated Deployments  Zero Configuration Monitoring
  • 33.
  • 34.
    What is thegoal?  Performance and Scalability are not self serving  “Desired” End User Experience  Faster than that is not better  Using less resources is cheaper!
  • 35.
    A Price PerformanceIndex  Dollar Value for acceptable Performance: 90th response time/(Total Cost/Number of Transactions) Desired Throughput/Total Cost  Mind Volatility  Price Performance Index is comparable  Cost Scalability  Cost per Transaction must remain stable Performance is not based on Capacity It is a function of desired User Experience and associated Cost
  • 36.
    Questions Michael Kopp Michael.kopp@dynaTrace.com http://blog.dynatrace.com @mikopp

Editor's Notes

  • #2 Surveys find that (http://callcenterinfo.tmcnet.com/analysis/articles/149923-survey-finds-cloud-application-performance-concern-delaying-adoption.htm) performance concerns about the cloud rise.Delay of cloud adoption due to perceived and measured bad performance. When we look at it however the real problem is not the bad performance itself, but that it is not understood what to do in such a case?- Cloud Provider SLAs are purely on availability metrics and there mostly on availability of their APIs and not the instances them selves- There are no SLAs on actual provided capacity nor reports on actual consumed capacityTo make matters worse, due to the technology itself traditional APM tools fail to deliver these metrics, so the cloud customer is left in the lerge.Is it the Cloud or is it the Application. Or both? or None?So the first thing that we need to solve the cloud performance concern is the ability to measure our application and identify the root cause of performance issues be it the cloud, a thirdparty service, the application itself or further upfront in the delivery chain.That however brings up a far more important question, what does performance mean? And here it can be said that actually the term performance does not change in the cloud. If we define performance as pure speed, then it is independent of the cloud, it does not matter how much instances we have. Speed is defined by the response time of a single transaction under defined circumstances. To make things simple, lets define performance being the speed of a single transaction when there is nothing else going on.Flow.Raw speed can be impacted by cloud hardware, services and everything else. While we can measure that by looking at things like node response time, the only way to analyze it is to get visibility into the transaction. Then we see whether it is the application that is slow, squandering resources or if it is waiting for resources or simply not getting enough CPU. The beauty is that can now be compared with speed on premise in a similar distrubted setup. A comparision will show the differences. and while we can never analyze cloud issues on premise we can understand where the cloud has impact in comparision to on premise. and we can identify these issues even if we don't compare.Now about scalability, this is the main case for the cloud. Scalability defines how much parallel transactions can be served without degradition of response time. or if we talk batch or transaction processing. How does throughput increase when adding another node. Now if "performance" goes down under load we scale up. if performance is than satisfactory again we say it scales. if performance goes down although we add resources than it does not scale. Or if we need to add thrise the number of resources for twice the load, it might scale but not very good. The important thing to understand now is that these kind of scalability issues can be again both in the application or the cloud. Only here it will not be a matter of cpu or disk most likely. the most likely congestion will happen in cloud services and network. And again we see why the current offered cloud monitoring is not enough to help. While we might be able to see the slow down of a service under load we will not see if it is uniformly slower or only for certain requests. so we do not see if it isreally the load that is the problem. the same is true for network. Of course for the application itself its even worse if we can't look inside.---- Scaling on application metrics. understand application impact, business impact.So in order to solve this we must again look inside the application. What's more we need to understand what the application is doing, which different transactions are doing what and how they might effect each other. In reallity it is not so much different from an on premise installation. But with much more moving parts.However with proper tools we can master this challenge.Now that we can measure, understand and diagnose our applications in the cloud we can also finally understand what performance means in the cloud. Or more presicely how the performance and scalability of our application differs there. We can now define what performance in the cloud means. It means Response Time/$ or Throughput/$. In this scenario the response time or throughput is something that you define and measure. once you achieve this than the in the cloud of your choice performance is not a "concern". However more importantly this kind of price performance index allows you to compare not only cloud against on premise it allows you to compare cloud vendors to each other!
  • #4 A common misnomer is that Scalability takes care of performance. That is not true. Performance is about speed of a single transaction or throughput at a given size. Scalability is about being able to get the same speed with more transactions and more nodes. Scalability is about doubling throughput when doubling the size. This actually means that an application needs to perform in order to scale!
  • #5 A common misnomer is that Scalability takes care of performance. That is not true. Performance is about speed of a single transaction or throughput at a given size. Scalability is about being able to get the same speed with more transactions and more nodes. Scalability is about doubling throughput when doubling the size. This actually means that an application needs to perform in order to scale!First a cloud build upon sharing resources can never perform better than a dedicated environment. But that is not even the question. The real question is
  • #8 End User Performance equals PurePerformance + Scalability
  • #11 Profilers will not work, cloud monitoring is not application monitoring. Application monitoring in its traditional sense only tells us when something is slow but not why. This is important because we cannot replicate it in a normal environment and we need to understand it fast, because tomorrow we will deploy again, new changes will make analysis all the harder and might add new problems. On the other hand if we find it fast, we have the chance of fixing and improving tomorrow without changing our schedule.
  • #13 As wehaveseenevnethe real utilizationcannottellperformance.Time is relativeUtilization in theguestisuselessUtilization on thehostdoes not allowtoinferperformanceThresholdscannotbemanagedPerformance can not beinferedfromresourceusage
  • #14 This can and should be measured outside the cloud. We can do this via synthetic transaction monitoring which gives us a good feel about the base line performance and wholesome degradations. Of course we need to be sure we do this from the most important locations in the world to take backbones into account. Another way of doing this is even closer to the user, which is called RUM or UEM. This measures the responsetime directly from the browser of the customer via injected java scrip agents.
  • #17 purepath
  • #21 If you don’t see anything here, then you really don’t care about it.
  • #24 One General and one Detail Transaction Flow with Database Impact. About Business Transactions
  • #26 CPU Usage on the Web Server is the cause for volatility here. This is really usage, not a percentage, which means it is really an application issue.If on the other hand we would see wait or I/O growing than it might well be virtualization that is the cause for volatility here.This is of course only a high level picture, but I think you get the idea.
  • #28 Scalability comes before performance in the cloud. Or to be more specific, Scalability trumps resource usage. We used to make a tradeoff between scalability and resource usage like CPU, Memory or disk usage. That does not hold true in a cloud. We have cpu, memory we have disk. The one thing that are still limiting factors are network and database. That needs to be taken care in the design. We can remove sync points in the database with NoSQL and Data denormalization. We can take care of network by using multiple zones and clouds and cdns to some degree. But to a larger degree bandwitdh needs to be taken care in the design.All that makes our application more scalable, the downside is that it makes it harder to understand single transactions, harder to monitor and harder to analyze. And of course, once we have an application, finding scalability issues is not easy, and cloud sizing does make it all the harder.