Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Performance Monitoring in the Cloud - Gluecon 2011

1,295 views

Published on

Talk at GlueCon 2011 on Performance Monitoring and the Cloud

Topics
What is performance monitoring
How does the cloud change things
What should developers do?
The ideal operations dashboard

Published in: Technology, Business
  • Be the first to comment

Performance Monitoring in the Cloud - Gluecon 2011

  1. 1. PerformanceMonitoring inthe Cloud Paul Guth Technical Operations
  2. 2. Agenda Performance and Monitoring and Performance Monitoring How THE CLOUD changes things What you should do (you = cloud developers) What I’d like to seeGluecon - 2011 Cloudscaling - Paul Guth 2
  3. 3. Agenda Performance and Monitoring and Performance Monitoring How THE CLOUD changes things What you should do (you = cloud developers) What I’d like to see We are NOT going to talk about using the cloud to do performance testingGluecon - 2011 Cloudscaling - Paul Guth 2
  4. 4. What Cloudscaling - Paul Guth 3
  5. 5. What is Performance?Gluecon - 2011 Cloudscaling - Paul Guth 4
  6. 6. What is Performance? Numbers speed - rate (184.7 mph) time per unit work (0-60 in 4.1s, 3:04.0min lightning lap) ef ciency (23 mpg) stability (1.00g skidpad) internals (550hp, 510 lb-ft) throughput (4 seats, 13.4 cu ft trunk)Gluecon - 2011 Cloudscaling - Paul Guth 4
  7. 7. What is Performance? Numbers speed - rate (184.7 mph) time per unit work (0-60 in 4.1s, 3:04.0min lightning lap) ef ciency (23 mpg) stability (1.00g skidpad) internals (550hp, 510 lb-ft) throughput (4 seats, 13.4 cu ft trunk) Numbers aren’t everything RWD, live rear axle, 56/44 f/r, airbags, LATCH, ABS - also matterGluecon - 2011 Cloudscaling - Paul Guth 4
  8. 8. What is Performance? Numbers speed - rate (184.7 mph) time per unit work (0-60 in 4.1s, 3:04.0min lightning lap) ef ciency (23 mpg) stability (1.00g skidpad) internals (550hp, 510 lb-ft) throughput (4 seats, 13.4 cu ft trunk) Numbers aren’t everything RWD, live rear axle, 56/44 f/r, airbags, LATCH, ABS - also matterGluecon - 2011 Cloudscaling - Paul Guth 4
  9. 9. What is Monitoring?Gluecon - 2011 Cloudscaling - Paul Guth 5
  10. 10. What is Monitoring? Observing system state through measurements (metrics)Gluecon - 2011 Cloudscaling - Paul Guth 5
  11. 11. What is Monitoring? Observing system state through measurements (metrics) Why? There’s more than one purpose. Detect problems for immediate action Oh noes! Response time just doubled! Provide data for diagnosing problems WTH changed in the last ten minutes? Inform decisions for long-term action What is the current constraint on total throughput? Forecasts for demand are Y by ChristmasGluecon - 2011 Cloudscaling - Paul Guth 5
  12. 12. What Do We Do In Old IT?Gluecon - 2011 Cloudscaling - Paul Guth 6
  13. 13. What Do We Do In Old IT? Immediate actions triggered by performance monitoring include:Gluecon - 2011 Cloudscaling - Paul Guth 6
  14. 14. What Do We Do In Old IT? Immediate actions triggered by performance monitoring include: Activate standby capacityGluecon - 2011 Cloudscaling - Paul Guth 6
  15. 15. What Do We Do In Old IT? Immediate actions triggered by performance monitoring include: Activate standby capacity Turn off featuresGluecon - 2011 Cloudscaling - Paul Guth 6
  16. 16. What Do We Do In Old IT? Immediate actions triggered by performance monitoring include: Activate standby capacity Turn off features Throttle incoming demandGluecon - 2011 Cloudscaling - Paul Guth 6
  17. 17. What Do We Do In Old IT? Immediate actions triggered by performance monitoring include: Activate standby capacity Turn off features Throttle incoming demand “Fix something”Gluecon - 2011 Cloudscaling - Paul Guth 6
  18. 18. What Do We Do In Old IT? Immediate actions triggered by performance monitoring include: Activate standby capacity Turn off features Throttle incoming demand “Fix something” Longer term actions include:Gluecon - 2011 Cloudscaling - Paul Guth 6
  19. 19. What Do We Do In Old IT? Immediate actions triggered by performance monitoring include: Activate standby capacity Turn off features Throttle incoming demand “Fix something” Longer term actions include: Deploying new infrastructure (or removing unneeded)Gluecon - 2011 Cloudscaling - Paul Guth 6
  20. 20. What Do We Do In Old IT? Immediate actions triggered by performance monitoring include: Activate standby capacity Turn off features Throttle incoming demand “Fix something” Longer term actions include: Deploying new infrastructure (or removing unneeded) “Design something”Gluecon - 2011 Cloudscaling - Paul Guth 6
  21. 21. What Do We Do In Old IT? Common Theme: CapacityGluecon - 2011 Cloudscaling - Paul Guth 6
  22. 22. The Cloud Cloudscaling - Paul Guth 7
  23. 23. Something Cloudy This Way Comes 8Gluecon - 2011 Cloudscaling - Paul Guth
  24. 24. Something Cloudy This Way Comes Traditional IT Adding capacity is: • Expensive • Has a long lead time • Non-reversible • Cheaper when done in large batches • Requires capex outlay 8Gluecon - 2011 Cloudscaling - Paul Guth
  25. 25. Something Cloudy This Way Comes Traditional IT The Cloud Adding capacity is: Adding capacity is: • Expensive • Low marginal cost • Has a long lead time • Quick • Non-reversible • Reversible • Cheaper when done in • Same price in small large batches batches • Requires capex outlay • All opex 8Gluecon - 2011 Cloudscaling - Paul Guth
  26. 26. Something Cloudy This Way Comes Traditional IT The Cloud Adding capacity is: Adding capacity is: • Expensive • Low marginal cost • Has a long lead time • Quick • Non-reversible • Reversible • Cheaper when done in • Same price in small large batches batches • Requires capex outlay • All opex What this means is you can add capacity as an immediate activity, when it used to be a long-term activity. 8Gluecon - 2011 Cloudscaling - Paul Guth
  27. 27. Something Cloudy This Way Comes Traditional IT The Cloud Adding capacity is: Adding capacity is: • Expensive • Low marginal cost • Has a long lead time • Quick • Non-reversible • Reversible • Cheaper when done in • Same price in small large batches batches • Requires capex outlay • All opex What this means is you can add capacity as an immediate activity, when it used to be a long-term activity. In fact, you can automate it. 8Gluecon - 2011 Cloudscaling - Paul Guth
  28. 28. Paradise - The End?Gluecon - 2011 Cloudscaling - Paul Guth 9
  29. 29. Paradise - The End? Wait, all is not perfectGluecon - 2011 Cloudscaling - Paul Guth 9
  30. 30. Paradise - The End? Wait, all is not perfect Sometimes adding capacity is not the right answer Some problems autoscale to in nity Fixing ef ciency may be requiredGluecon - 2011 Cloudscaling - Paul Guth 9
  31. 31. Paradise - The End? Wait, all is not perfect Sometimes adding capacity is not the right answer Some problems autoscale to in nity Fixing ef ciency may be required Adding capacity is cheap but not free Spin up 10k new instances in a day and your controller will want an explanation Transparently balance cost vs bene tGluecon - 2011 Cloudscaling - Paul Guth 9
  32. 32. Paradise - The End? Wait, all is not perfect Sometimes adding capacity is not the right answer Some problems autoscale to in nity Fixing ef ciency may be required Adding capacity is cheap but not free Spin up 10k new instances in a day and your controller will want an explanation Transparently balance cost vs bene t You still need some bufferGluecon - 2011 Cloudscaling - Paul Guth 9
  33. 33. Paradise - The End? Wait, all is not perfect Sometimes adding capacity is not the right answer Some problems autoscale to in nity Fixing ef ciency may be required Adding capacity is cheap but not free Spin up 10k new instances in a day and your controller will want an explanation Transparently balance cost vs bene t You still need some buffer Your automation needs limits Just Say No to SkynetGluecon - 2011 Cloudscaling - Paul Guth 9
  34. 34. Paradise - The End? Wait, all is not perfect Sometimes adding capacity is not the right answer Some problems autoscale to in nity Fixing ef ciency may be required Adding capacity is cheap but not free Spin up 10k new instances in a day and your controller will want an explanation Transparently balance cost vs bene t You still need some buffer Your automation needs limits Just Say No to Skynet “Auto-scale me if you want to live!”Gluecon - 2011 Cloudscaling - Paul Guth 9
  35. 35. What’s a CloudDeveloper to Do? Cloudscaling - Paul Guth 10
  36. 36. What Do You Do?Gluecon - 2011 Cloudscaling - Paul Guth 11
  37. 37. What Do You Do? What to measure? iops, cpu util, memfree, queue latency? USELESS! (*)Gluecon - 2011 Cloudscaling - Paul Guth 11
  38. 38. What Do You Do? What to measure? iops, cpu util, memfree, queue latency? USELESS! (*) First measure what your customers care about What do they pay you for? Response time, load time, functional correctnessGluecon - 2011 Cloudscaling - Paul Guth 11
  39. 39. What Do You Do? What to measure? iops, cpu util, memfree, queue latency? USELESS! (*) First measure what your customers care about What do they pay you for? Response time, load time, functional correctness Monitor services (from customer perspective), not serversGluecon - 2011 Cloudscaling - Paul Guth 11
  40. 40. What Do You Do? What to measure? iops, cpu util, memfree, queue latency? USELESS! (*) First measure what your customers care about What do they pay you for? Response time, load time, functional correctness Monitor services (from customer perspective), not servers Monitor your cost as well (COGS) - costs more variable nowGluecon - 2011 Cloudscaling - Paul Guth 11
  41. 41. What Do You Do? What to measure? iops, cpu util, memfree, queue latency? USELESS! (*) First measure what your customers care about What do they pay you for? Response time, load time, functional correctness Monitor services (from customer perspective), not servers Monitor your cost as well (COGS) - costs more variable now (*) NOTE: Not actually uselessGluecon - 2011 Cloudscaling - Paul Guth 11
  42. 42. Thought ExperimentGluecon - 2011 Cloudscaling - Paul Guth 12
  43. 43. Thought Experiment 0300 SundayGluecon - 2011 Cloudscaling - Paul Guth 12
  44. 44. Thought Experiment 0300 Sunday CPU Utilization on your cluster increases to 100%Gluecon - 2011 Cloudscaling - Paul Guth 12
  45. 45. Thought Experiment 0300 Sunday CPU Utilization on your cluster increases to 100% External service monitoring shows no problemsGluecon - 2011 Cloudscaling - Paul Guth 12
  46. 46. Thought Experiment 0300 Sunday CPU Utilization on your cluster increases to 100% External service monitoring shows no problems What do yo do?Gluecon - 2011 Cloudscaling - Paul Guth 12
  47. 47. Thought Experiment 0300 Sunday CPU Utilization on your cluster increases to 100% External service monitoring shows no problems What do yo do? Go back to sleep!Gluecon - 2011 Cloudscaling - Paul Guth 12
  48. 48. Thought Experiment 0300 Sunday CPU Utilization on your cluster increases to 100% External service monitoring shows no problems What do yo do? Go back to sleep! Please investigate on Monday, you’re probably wasting moneyGluecon - 2011 Cloudscaling - Paul Guth 12
  49. 49. Monitoring Hierarchy Customer Services Application Metrics Server MetricsGluecon - 2011 Cloudscaling - Paul Guth 13
  50. 50. Other Things to DoGluecon - 2011 Cloudscaling - Paul Guth 14
  51. 51. Other Things to Do Record religiously all data around resource callsGluecon - 2011 Cloudscaling - Paul Guth 14
  52. 52. Other Things to Do Record religiously all data around resource calls Put in monitoring/metrics from the start!Gluecon - 2011 Cloudscaling - Paul Guth 14
  53. 53. Other Things to Do Record religiously all data around resource calls Put in monitoring/metrics from the start! Make it trivial (for devs) to record metrics, and incentivize them collectd, graphite, etc Build it into your framework/platform of choiceGluecon - 2011 Cloudscaling - Paul Guth 14
  54. 54. Other Things to Do Record religiously all data around resource calls Put in monitoring/metrics from the start! Make it trivial (for devs) to record metrics, and incentivize them collectd, graphite, etc Build it into your framework/platform of choice Make sure this monitoring scales out automatically when new instances appear and is retained when instances disappear As much of this monitoring as possible should be at the cluster, not instance levelGluecon - 2011 Cloudscaling - Paul Guth 14
  55. 55. Other Things to Do Record religiously all data around resource calls Put in monitoring/metrics from the start! Make it trivial (for devs) to record metrics, and incentivize them collectd, graphite, etc Build it into your framework/platform of choice Make sure this monitoring scales out automatically when new instances appear and is retained when instances disappear As much of this monitoring as possible should be at the cluster, not instance level Use much more care when guring out what to alert about - start with just customer services False positives can be killers Use data for diagnosis rst, then learn when to alertGluecon - 2011 Cloudscaling - Paul Guth 14
  56. 56. Other Other Things to DoGluecon - 2011 Cloudscaling - Paul Guth 15
  57. 57. Other Other Things to Do Treat dependencies as if they’re vitally important! Have model. Have API. Use model API Have tools to visualize the model Leverage it in your other toolsGluecon - 2011 Cloudscaling - Paul Guth 15
  58. 58. Other Other Things to Do Treat dependencies as if they’re vitally important! Have model. Have API. Use model API Have tools to visualize the model Leverage it in your other tools Automate handling all anticipated conditionsGluecon - 2011 Cloudscaling - Paul Guth 15
  59. 59. Other Other Things to Do Treat dependencies as if they’re vitally important! Have model. Have API. Use model API Have tools to visualize the model Leverage it in your other tools Automate handling all anticipated conditions Learn learn learnGluecon - 2011 Cloudscaling - Paul Guth 15
  60. 60. Other Other Things to Do Treat dependencies as if they’re vitally important! Have model. Have API. Use model API Have tools to visualize the model Leverage it in your other tools Automate handling all anticipated conditions Learn learn learn Measure customer experience!Gluecon - 2011 Cloudscaling - Paul Guth 15
  61. 61. Other Other Things to Do Treat dependencies as if they’re vitally important! Have model. Have API. Use model API Have tools to visualize the model Leverage it in your other tools Automate handling all anticipated conditions Learn learn learn Measure customer experience! Test in production (learn from production at least)Gluecon - 2011 Cloudscaling - Paul Guth 15
  62. 62. Other Other Things to Do Treat dependencies as if they’re vitally important! Have model. Have API. Use model API Have tools to visualize the model Leverage it in your other tools Automate handling all anticipated conditions Learn learn learn Measure customer experience! Test in production (learn from production at least)Gluecon - 2011 Cloudscaling - Paul Guth 15
  63. 63. What The World Needs Now Cloudscaling - Paul Guth 16
  64. 64. The Ideal Monitoring DashboardGluecon - 2011 Cloudscaling - Paul Guth 17
  65. 65. The Ideal Monitoring Dashboard OK $/sec (in): 1,000 $/sec (out): 1,000Gluecon - 2011 Cloudscaling - Paul Guth 17
  66. 66. DashboardsGluecon - 2011 Cloudscaling - Paul Guth 18
  67. 67. Dashboards Actionable data! Long lists of stuff that’s OK are useless “Event consoles” are useless - I care about current state, not what happened ve minutes ago Too much data is worse than no dataGluecon - 2011 Cloudscaling - Paul Guth 18
  68. 68. Dashboards Actionable data! Long lists of stuff that’s OK are useless “Event consoles” are useless - I care about current state, not what happened ve minutes ago Too much data is worse than no data At top-level, show only customer servicesGluecon - 2011 Cloudscaling - Paul Guth 18
  69. 69. Dashboards Actionable data! Long lists of stuff that’s OK are useless “Event consoles” are useless - I care about current state, not what happened ve minutes ago Too much data is worse than no data At top-level, show only customer services Drill-down to what you wantGluecon - 2011 Cloudscaling - Paul Guth 18
  70. 70. Dashboards Actionable data! Long lists of stuff that’s OK are useless “Event consoles” are useless - I care about current state, not what happened ve minutes ago Too much data is worse than no data At top-level, show only customer services Drill-down to what you want Filter easily to narrow inGluecon - 2011 Cloudscaling - Paul Guth 18
  71. 71. Dashboards Actionable data! Long lists of stuff that’s OK are useless “Event consoles” are useless - I care about current state, not what happened ve minutes ago Too much data is worse than no data At top-level, show only customer services Drill-down to what you want Filter easily to narrow in Save the trees, have a search boxGluecon - 2011 Cloudscaling - Paul Guth 18
  72. 72. Dashboards Actionable data! Long lists of stuff that’s OK are useless “Event consoles” are useless - I care about current state, not what happened ve minutes ago Too much data is worse than no data At top-level, show only customer services Drill-down to what you want Filter easily to narrow in Save the trees, have a search box Add arbitrary time-series data to any chart - including changelogs, business metricsGluecon - 2011 Cloudscaling - Paul Guth 18
  73. 73. One Size Fits One Use different interfaces for different purposes and/or different audiences vsGluecon - 2011 Cloudscaling - Paul Guth 19
  74. 74. Summary Performance Monitoring and Capacity Management are joined at the hip The Cloud enables automated, immediate capacity xes The price of automation is eternal vigilance Monitor your customer-facing services rst Make it so easy to collect metrics that you’ll have tons and tons of them Magic dashboard make Paul happy!Gluecon - 2011 Cloudscaling - Paul Guth 20
  75. 75. Thank You! g < at > cloudscaling d0t com @pguthebGluecon - 2011 Cloudscaling - Paul Guth 21

×