Application Performance      Management   “tightening up your backend”          dan kuebrich       dan@tracelytics.com
speed: where is it?
speed: where is it?            DNSDNS, connection
speed: where is it?             DNS             First HTTP Request        your boxesDNS, connection    Fulfill HTTP Request...
speed: where is it?            DNS            First HTTP Request         your boxes            Subsequent HTTP RequestsDNS...
Speed
What’s taking so long?      ...
What’s taking so long?        ...     Time to connect (3ms)
What’s taking so long?        ...     Time to connect (3ms)            Time to first byte (1.61s)
What’s taking so long?                                    33%        ...     Time to connect (3ms)            Time to first...
What’s taking so long?           ?
What is in that bar?
Why you care (performance)• Speed optimization
Why you care (performance)• Speed optimization  • A lot on client side, but not all
Why you care (performance)• Speed optimization  • A lot on client side, but not all• Troubleshooting  • Service disruption...
Why you care (performance)• Speed optimization  • A lot on client side, but not all• Troubleshooting  • Service disruption...
Why you care (performance)• Speed optimization  • A lot on client side, but not all• Troubleshooting  • Service disruption...
1996
2011
It’s all about tradeoffs              good / evil
It’s all about tradeoffs              good / evil             risk / reward
It’s all about tradeoffs                good / evil              risk / reward          fearlessness / sobriety
How to make decisions (ideally)1. Decide what to measure
How to make decisions (ideally)1. Decide what to measure2. Measure, examine
How to make decisions (ideally)1. Decide what to measure2. Measure, examine3. Act
How to make decisions (ideally)1. Decide what to measure2. Measure, examine3. Act4. Check
1. What to measure• Depends on what you’re looking for  • Bottlenecks -- db or app server  • Outages -- blocking on servic...
1. What to measure• Depends on what you’re looking for  • Bottlenecks -- db or app server  • Outages -- blocking on servic...
How to make decisions (ideally)
How to make decisions (ideally)1. Decide what to measure
How to make decisions (ideally)1. Decide what to measure2. Measure, examine
How to make decisions (ideally)1. Decide what to measure2. Measure, examine3. Act
How to make decisions (ideally)1. Decide what to measure2. Measure, examine3. Act4. Check
1. What to measure• Depends on what you’re measuring  • DB = i/o, slow query log, buffer cache  • Server = fastcgi queue  ...
1. What to measure• Depends on what you’re measuring  • DB = i/o, slow query log, buffer cache  • Server = fastcgi queue  ...
1. What to measure• Depends on what you’re measuring  • DB = i/o, slow query log, buffer cache  • Server = fastcgi queue  ...
1. What to measure• Depends on what you’re measuring  • DB = i/o, slow query log, buffer cache  • Server = fastcgi queue  ...
2. How to measure• Machine-level  • Cpu, load, i/o, network• Component-level  • Logs, instrumentation  • New Relic, Query ...
2. Machine metrics• You have four basic resources  • CPU  • RAM  • I/O  • Network• Open-source: Ganglia, Munin, Zabbix, et...
2. Machine Metrics
2. Machine metrics• Home run:  • DB has high CPU wait  • Requests are slow -- why?• Falling short:  • Low CPU usage on app...
2. Component metrics• Very heterogeneous  • Throughput metrics  • Error conditions  • Profiling data• Collect from:  • Log...
2. Component metrics• Home run:  • Low CPU usage on app and DB  • Low disk usage on DB  • App instrumentation shows time s...
2. Looking for blame            A            B
2. Looking for blame            A            B
2. Looking for blame                       HELP!            A            B
2. Finding blame            A            B
2. Finding blame       No, help ME!                A                 B
2. Finding blame       No, help ME!                   A            127   results
+=
24;                   B            128...
2. Tracing metrics• Profiling + flow-of-control• Causal organization
2. Tracing metrics• Profiling + flow-of-control• Causal organization  • Lamport’s “happens before”
2. Tracing metrics• Profiling + flow-of-control• Causal organization  • Lamport’s “happens before”• Who does this?
2. Tracing metrics• Profiling + flow-of-control• Causal organization  • Lamport’s “happens before”• Who does this?  • In-h...
2. Tracing metrics• Profiling + flow-of-control• Causal organization  • Lamport’s “happens before”• Who does this?  • In-h...
2. Tracing metrics• Profiling + flow-of-control• Causal organization  • Lamport’s “happens before”• Who does this?  • In-h...
2. Tracing metrics• Profiling + flow-of-control• Causal organization  • Lamport’s “happens before”• Who does this?  • In-h...
2. Tracing Metrics
How to make decisions (ideally)
How to make decisions (ideally)1. Decide what to measure
How to make decisions (ideally)1. Decide what to measure2. Measure, examine
How to make decisions (ideally)1. Decide what to measure2. Measure, examine3. Act
How to make decisions (ideally)1. Decide what to measure2. Measure, examine3. Act4. Check
3. Act• You found your problem
3. Act• You found your problem  • If not, go back 20 slides and repeat...
3. Act• You found your problem  • If not, go back 20 slides and repeat...• Infrastructure upgrades
3. Act• You found your problem  • If not, go back 20 slides and repeat...• Infrastructure upgrades   • More boxes, better ...
3. Act• You found your problem  • If not, go back 20 slides and repeat...• Infrastructure upgrades   • More boxes, better ...
3. Act• You found your problem  • If not, go back 20 slides and repeat...• Infrastructure upgrades   • More boxes, better ...
3. Act• You found your problem  • If not, go back 20 slides and repeat...• Infrastructure upgrades   • More boxes, better ...
3. Act• You found your problem  • If not, go back 20 slides and repeat...• Infrastructure upgrades   • More boxes, better ...
3. Act• You found your problem  • If not, go back 20 slides and repeat...• Infrastructure upgrades   • More boxes, better ...
3. Act• You found your problem  • If not, go back 20 slides and repeat...• Infrastructure upgrades   • More boxes, better ...
3. Caching• Store things where they can be retrieved more cheaply  (faster)
3. C.R.E.A.M.
3. C.R.E.A.M.• Browser cache
3. C.R.E.A.M.• Browser cache• CDN
3. C.R.E.A.M.• Browser cache• CDN• Proxy / optimizer
3. C.R.E.A.M.• Browser cache• CDN• Proxy / optimizer• Opcode
3. C.R.E.A.M.• Browser cache• CDN• Proxy / optimizer• Opcode• Application-driven
3. C.R.E.A.M.• Browser cache• CDN• Proxy / optimizer• Opcode• Application-driven  • App-specific cache
3. C.R.E.A.M.• Browser cache• CDN• Proxy / optimizer• Opcode• Application-driven  • App-specific cache  • ORM cache
3. C.R.E.A.M.• Browser cache• CDN• Proxy / optimizer• Opcode• Application-driven  • App-specific cache  • ORM cache  • Loc...
3. C.R.E.A.M.• Browser cache• CDN• Proxy / optimizer• Opcode• Application-driven  • App-specific cache  • ORM cache  • Loc...
3. C.R.E.A.M.• Browser cache• CDN• Proxy / optimizer• Opcode• Application-driven  • App-specific cache  • ORM cache  • Loc...
3. C.R.E.A.M.• Browser cache• CDN• Proxy / optimizer• Opcode• Application-driven  • App-specific cache  • ORM cache  • Loc...
3. C.R.E.A.M.• Browser cache• CDN• Proxy / optimizer• Opcode cache• Application-driven  • App-specific cache  • ORM cache ...
3. C.R.E.A.M.• Browser cache• CDN                       More speed gain,• Proxy / optimizer         More invalidations• Op...
3. When to cache• Protect resources  • DB  • Services• Cover for slow actions  • DB  • Disk hits  • External service calls...
3. Deferred work• Presmise: synchronous work is lame  • Go async!• Mechanism: queue  • RabbitMQ, 0MQ, ActiveMQ, Amazon SQS...
3. When to queue• Actions you can decouple from that page load  • Things that don’t have to update in real-time    • Count...
3. Redistribute work• Service-oriented architecture  • Reusable components  • Co-tenable components                       ...
3. SOA• We’ve got two pages on our website and one box  serving it       def fast_action():      def slow_action():       ...
3. SOA• Take 1: buy more servers  • But if anyone calls slow action on one, we lose  • All servers must be able to handle ...
3. Resource scheduling                         Low                         Low                         Low                ...
3. Resource scheduling                       Low                       Low                       Low                 app  ...
How to make decisions (ideally)
How to make decisions (ideally)1. Decide what to measure
How to make decisions (ideally)1. Decide what to measure2. Measure, examine
How to make decisions (ideally)1. Decide what to measure2. Measure, examine3. Act
How to make decisions (ideally)1. Decide what to measure2. Measure, examine3. Act4. Check
4. Did we ruin everything?• If your metrics were right, things are probably faster   • But they’re different   • ... and p...
Takeaways• Hard to solve problems without understanding them at  a fundamental level  • Get data, visualize• Machine and c...
Thanks!    dan kuebrich dan@tracelytics.com
Upcoming SlideShare
Loading in...5
×

bp

318

Published on

0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
318
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
0
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • split this into 4 pages with more info\n\nso, it’s important, but I’d argue that it’s also particularly interesting, and only getting more so\n
  • split this into 4 pages with more info\n\nso, it’s important, but I’d argue that it’s also particularly interesting, and only getting more so\n
  • split this into 4 pages with more info\n\nso, it’s important, but I’d argue that it’s also particularly interesting, and only getting more so\n
  • split this into 4 pages with more info\n\nso, it’s important, but I’d argue that it’s also particularly interesting, and only getting more so\n
  • split this into 4 pages with more info\n\nso, it’s important, but I’d argue that it’s also particularly interesting, and only getting more so\n
  • split this into 4 pages with more info\n\nso, it’s important, but I’d argue that it’s also particularly interesting, and only getting more so\n
  • split this into 4 pages with more info\n\nso, it’s important, but I’d argue that it’s also particularly interesting, and only getting more so\n
  • split this into 4 pages with more info\n\nso, it’s important, but I’d argue that it’s also particularly interesting, and only getting more so\n
  • \n
  • \n
  • \n
  • \n
  • \n
  • there’s a lot -- too much. here’s a little bit\n
  • I took a class called making decisions...\nit’s about measuring, then optimizing\ncan’t tell you how to act--that’s too specific\nbut can tell you how to decide how to act\n
  • I took a class called making decisions...\nit’s about measuring, then optimizing\ncan’t tell you how to act--that’s too specific\nbut can tell you how to decide how to act\n
  • I took a class called making decisions...\nit’s about measuring, then optimizing\ncan’t tell you how to act--that’s too specific\nbut can tell you how to decide how to act\n
  • \n
  • \n
  • \n
  • \n
  • I took a class called making decisions...\n
  • I took a class called making decisions...\n
  • I took a class called making decisions...\n
  • I took a class called making decisions...\n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • I took a class called making decisions...\n
  • I took a class called making decisions...\n
  • I took a class called making decisions...\n
  • I took a class called making decisions...\n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • todo: replace with helloapp?\n
  • I took a class called making decisions...\n
  • I took a class called making decisions...\n
  • I took a class called making decisions...\n
  • I took a class called making decisions...\n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • I took a class called making decisions...\n
  • I took a class called making decisions...\n
  • I took a class called making decisions...\n
  • I took a class called making decisions...\n
  • \n
  • \n
  • \n
  • \n
  • \n
  • Transcript of "bp"

    1. 1. Application Performance Management “tightening up your backend” dan kuebrich dan@tracelytics.com
    2. 2. speed: where is it?
    3. 3. speed: where is it? DNSDNS, connection
    4. 4. speed: where is it? DNS First HTTP Request your boxesDNS, connection Fulfill HTTP Request (“Time to first byte”)
    5. 5. speed: where is it? DNS First HTTP Request your boxes Subsequent HTTP RequestsDNS, connection Fulfill HTTP Request (“Time to first byte”) Download + render page contents (+js)
    6. 6. Speed
    7. 7. What’s taking so long? ...
    8. 8. What’s taking so long? ... Time to connect (3ms)
    9. 9. What’s taking so long? ... Time to connect (3ms) Time to first byte (1.61s)
    10. 10. What’s taking so long? 33% ... Time to connect (3ms) Time to first byte (1.61s)
    11. 11. What’s taking so long? ?
    12. 12. What is in that bar?
    13. 13. Why you care (performance)• Speed optimization
    14. 14. Why you care (performance)• Speed optimization • A lot on client side, but not all
    15. 15. Why you care (performance)• Speed optimization • A lot on client side, but not all• Troubleshooting • Service disruptions -- resolve ASAP
    16. 16. Why you care (performance)• Speed optimization • A lot on client side, but not all• Troubleshooting • Service disruptions -- resolve ASAP• Concurrency • How does it scale?
    17. 17. Why you care (performance)• Speed optimization • A lot on client side, but not all• Troubleshooting • Service disruptions -- resolve ASAP• Concurrency • How does it scale?• Money • The purple bar is expensive.
    18. 18. 1996
    19. 19. 2011
    20. 20. It’s all about tradeoffs good / evil
    21. 21. It’s all about tradeoffs good / evil risk / reward
    22. 22. It’s all about tradeoffs good / evil risk / reward fearlessness / sobriety
    23. 23. How to make decisions (ideally)1. Decide what to measure
    24. 24. How to make decisions (ideally)1. Decide what to measure2. Measure, examine
    25. 25. How to make decisions (ideally)1. Decide what to measure2. Measure, examine3. Act
    26. 26. How to make decisions (ideally)1. Decide what to measure2. Measure, examine3. Act4. Check
    27. 27. 1. What to measure• Depends on what you’re looking for • Bottlenecks -- db or app server • Outages -- blocking on services • Business metrics -- SLA reports, infrastructure utilization• Measure as much as possible (reasonable)
    28. 28. 1. What to measure• Depends on what you’re looking for • Bottlenecks -- db or app server • Outages -- blocking on services • Business metrics -- SLA reports, infrastructure utilization• Measure as much as possible (reasonable) • You’ll never have all the data you want
    29. 29. How to make decisions (ideally)
    30. 30. How to make decisions (ideally)1. Decide what to measure
    31. 31. How to make decisions (ideally)1. Decide what to measure2. Measure, examine
    32. 32. How to make decisions (ideally)1. Decide what to measure2. Measure, examine3. Act
    33. 33. How to make decisions (ideally)1. Decide what to measure2. Measure, examine3. Act4. Check
    34. 34. 1. What to measure• Depends on what you’re measuring • DB = i/o, slow query log, buffer cache • Server = fastcgi queue • App = cpu/network • Cache = ram, eviction, hits
    35. 35. 1. What to measure• Depends on what you’re measuring • DB = i/o, slow query log, buffer cache • Server = fastcgi queue • App = cpu/network • Cache = ram, eviction, hits• Tower of Babel?
    36. 36. 1. What to measure• Depends on what you’re measuring • DB = i/o, slow query log, buffer cache • Server = fastcgi queue • App = cpu/network • Cache = ram, eviction, hits• Tower of Babel?• Common language: latency
    37. 37. 1. What to measure• Depends on what you’re measuring • DB = i/o, slow query log, buffer cache • Server = fastcgi queue • App = cpu/network • Cache = ram, eviction, hits• Tower of Babel?• Common language: latency • “Profiling”
    38. 38. 2. How to measure• Machine-level • Cpu, load, i/o, network• Component-level • Logs, instrumentation • New Relic, Query Analyzer• Request-level • Tracing
    39. 39. 2. Machine metrics• You have four basic resources • CPU • RAM • I/O • Network• Open-source: Ganglia, Munin, Zabbix, etc.• Commercial: CloudKick, AppFirst, Librato, etc...• Everybody uses some form of this • Facebook monitors over 5 million metrics with Ganglia
    40. 40. 2. Machine Metrics
    41. 41. 2. Machine metrics• Home run: • DB has high CPU wait • Requests are slow -- why?• Falling short: • Low CPU usage on app and DB • Low disk usage on DB • Requests are slow -- why?
    42. 42. 2. Component metrics• Very heterogeneous • Throughput metrics • Error conditions • Profiling data• Collect from: • Logs: tail -f, Splunk, Loggly, Hoptoad • Service calls: JMX • Profiling: xhprof, cProfile • Other: New Relic, Query Analyzers• Basically everybody does this too in some form
    43. 43. 2. Component metrics• Home run: • Low CPU usage on app and DB • Low disk usage on DB • App instrumentation shows time spent in service calls • fastcgi queue getting deep • Requests are slow -- why?
    44. 44. 2. Looking for blame A B
    45. 45. 2. Looking for blame A B
    46. 46. 2. Looking for blame HELP! A B
    47. 47. 2. Finding blame A B
    48. 48. 2. Finding blame No, help ME! A B
    49. 49. 2. Finding blame No, help ME! A 127 results
+=
24; B 128 129 do
this
a
lot: 130 

something_slow() 131 132 return
results; 133
    50. 50. 2. Tracing metrics• Profiling + flow-of-control• Causal organization
    51. 51. 2. Tracing metrics• Profiling + flow-of-control• Causal organization • Lamport’s “happens before”
    52. 52. 2. Tracing metrics• Profiling + flow-of-control• Causal organization • Lamport’s “happens before”• Who does this?
    53. 53. 2. Tracing metrics• Profiling + flow-of-control• Causal organization • Lamport’s “happens before”• Who does this? • In-house solutions • Google, Goldman Sachs, others?
    54. 54. 2. Tracing metrics• Profiling + flow-of-control• Causal organization • Lamport’s “happens before”• Who does this? • In-house solutions • Google, Goldman Sachs, others? • Open-source • X-Trace, Magpie
    55. 55. 2. Tracing metrics• Profiling + flow-of-control• Causal organization • Lamport’s “happens before”• Who does this? • In-house solutions • Google, Goldman Sachs, others? • Open-source • X-Trace, Magpie • Commercial availability
    56. 56. 2. Tracing metrics• Profiling + flow-of-control• Causal organization • Lamport’s “happens before”• Who does this? • In-house solutions • Google, Goldman Sachs, others? • Open-source • X-Trace, Magpie • Commercial availability • DynaTrace, Tracelytics
    57. 57. 2. Tracing Metrics
    58. 58. How to make decisions (ideally)
    59. 59. How to make decisions (ideally)1. Decide what to measure
    60. 60. How to make decisions (ideally)1. Decide what to measure2. Measure, examine
    61. 61. How to make decisions (ideally)1. Decide what to measure2. Measure, examine3. Act
    62. 62. How to make decisions (ideally)1. Decide what to measure2. Measure, examine3. Act4. Check
    63. 63. 3. Act• You found your problem
    64. 64. 3. Act• You found your problem • If not, go back 20 slides and repeat...
    65. 65. 3. Act• You found your problem • If not, go back 20 slides and repeat...• Infrastructure upgrades
    66. 66. 3. Act• You found your problem • If not, go back 20 slides and repeat...• Infrastructure upgrades • More boxes, better boxes
    67. 67. 3. Act• You found your problem • If not, go back 20 slides and repeat...• Infrastructure upgrades • More boxes, better boxes• Redistribute work / resource scheduling
    68. 68. 3. Act• You found your problem • If not, go back 20 slides and repeat...• Infrastructure upgrades • More boxes, better boxes• Redistribute work / resource scheduling • Service-oriented architecture (SOA)
    69. 69. 3. Act• You found your problem • If not, go back 20 slides and repeat...• Infrastructure upgrades • More boxes, better boxes• Redistribute work / resource scheduling • Service-oriented architecture (SOA)• Do less work
    70. 70. 3. Act• You found your problem • If not, go back 20 slides and repeat...• Infrastructure upgrades • More boxes, better boxes• Redistribute work / resource scheduling • Service-oriented architecture (SOA)• Do less work • Skip what you can, cache what you can’t
    71. 71. 3. Act• You found your problem • If not, go back 20 slides and repeat...• Infrastructure upgrades • More boxes, better boxes• Redistribute work / resource scheduling • Service-oriented architecture (SOA)• Do less work • Skip what you can, cache what you can’t• Do work later
    72. 72. 3. Act• You found your problem • If not, go back 20 slides and repeat...• Infrastructure upgrades • More boxes, better boxes• Redistribute work / resource scheduling • Service-oriented architecture (SOA)• Do less work • Skip what you can, cache what you can’t• Do work later • Deferred processing
    73. 73. 3. Caching• Store things where they can be retrieved more cheaply (faster)
    74. 74. 3. C.R.E.A.M.
    75. 75. 3. C.R.E.A.M.• Browser cache
    76. 76. 3. C.R.E.A.M.• Browser cache• CDN
    77. 77. 3. C.R.E.A.M.• Browser cache• CDN• Proxy / optimizer
    78. 78. 3. C.R.E.A.M.• Browser cache• CDN• Proxy / optimizer• Opcode
    79. 79. 3. C.R.E.A.M.• Browser cache• CDN• Proxy / optimizer• Opcode• Application-driven
    80. 80. 3. C.R.E.A.M.• Browser cache• CDN• Proxy / optimizer• Opcode• Application-driven • App-specific cache
    81. 81. 3. C.R.E.A.M.• Browser cache• CDN• Proxy / optimizer• Opcode• Application-driven • App-specific cache • ORM cache
    82. 82. 3. C.R.E.A.M.• Browser cache• CDN• Proxy / optimizer• Opcode• Application-driven • App-specific cache • ORM cache • Local (runtime) cache
    83. 83. 3. C.R.E.A.M.• Browser cache• CDN• Proxy / optimizer• Opcode• Application-driven • App-specific cache • ORM cache • Local (runtime) cache• Database
    84. 84. 3. C.R.E.A.M.• Browser cache• CDN• Proxy / optimizer• Opcode• Application-driven • App-specific cache • ORM cache • Local (runtime) cache• Database • Query cache
    85. 85. 3. C.R.E.A.M.• Browser cache• CDN• Proxy / optimizer• Opcode• Application-driven • App-specific cache • ORM cache • Local (runtime) cache• Database • Query cache • Denormalization
    86. 86. 3. C.R.E.A.M.• Browser cache• CDN• Proxy / optimizer• Opcode cache• Application-driven • App-specific cache • ORM cache • Local (runtime) cache• Database • Query cache • Denormalization
    87. 87. 3. C.R.E.A.M.• Browser cache• CDN More speed gain,• Proxy / optimizer More invalidations• Opcode cache• Application-driven • App-specific cache • ORM cache • Local (runtime) cache• Database • Query cache Less speed gain, • Denormalization Fewer invalidations
    88. 88. 3. When to cache• Protect resources • DB • Services• Cover for slow actions • DB • Disk hits • External service calls • Number-crunching
    89. 89. 3. Deferred work• Presmise: synchronous work is lame • Go async!• Mechanism: queue • RabbitMQ, 0MQ, ActiveMQ, Amazon SQS Q app servers workers/hadoop/?? db/cache
    90. 90. 3. When to queue• Actions you can decouple from that page load • Things that don’t have to update in real-time • Counter updates (queue and aggregate) • External API calls • Long-running requests (ajax) • Batch processing • Shell commands
    91. 91. 3. Redistribute work• Service-oriented architecture • Reusable components • Co-tenable components app
    92. 92. 3. SOA• We’ve got two pages on our website and one box serving it def fast_action(): def slow_action(): x *= y x = compute() render (‘fast.tpl’) render(‘slow.tpl’)• Problem? • Slow actions starve fast actions! • How to remedy?
    93. 93. 3. SOA• Take 1: buy more servers • But if anyone calls slow action on one, we lose • All servers must be able to handle slow_action’s workload• Take 2: pull out slow action def fast_action(): def slow_action(): x *= y x = remote_compute() render (‘fast.tpl’) render(‘slow.tpl’)• Who does this????
    94. 94. 3. Resource scheduling Low Low Low app Low High High Low Low Low memcached number-cruncher
    95. 95. 3. Resource scheduling Low Low Low app Low High High Low Low Low memcached number-cruncher
    96. 96. How to make decisions (ideally)
    97. 97. How to make decisions (ideally)1. Decide what to measure
    98. 98. How to make decisions (ideally)1. Decide what to measure2. Measure, examine
    99. 99. How to make decisions (ideally)1. Decide what to measure2. Measure, examine3. Act
    100. 100. How to make decisions (ideally)1. Decide what to measure2. Measure, examine3. Act4. Check
    101. 101. 4. Did we ruin everything?• If your metrics were right, things are probably faster • But they’re different • ... and probably more complicated• How do we keep track of it? • Better tools• Next month: performance and load testing with Selenium
    102. 102. Takeaways• Hard to solve problems without understanding them at a fundamental level • Get data, visualize• Machine and component metrics are key • Sometimes they’re not enough• Once we know a problem, there’s help • SOA, Cache, Deferral -- complementary tools• As web systems become more complicated, we must use more sophisticated tools to monitor and debug them
    103. 103. Thanks! dan kuebrich dan@tracelytics.com

    ×