Capacity Management Presentation

Loading...

Flash Player 9 (or above) is needed to view presentations.
We have detected that you do not have it on your computer. To install it, go here.

0 comments

Post a comment

    Post a comment
    Embed Video
    Edit your comment Cancel

    4 Favorites

    Capacity Management Presentation - Presentation Transcript

    1. Capacity Management
      • for Web Operations
      John Allspaw Operations Engineering
    2. the book I’m writing
    3. ???
    4. Rules of Thumb Planning/Forecasting Stupid Capacity Tricks (with some Flickr statistics sprinkled in)
      • bugs (disguised as capacity problems)
      • edge cases (disguised as capacity problems)
      • security incidents
      • real capacity problems*
      * (should be the last thing you need to worry about) Things that can cause downtime
    5. Capacity != Performance
      • Forget about performance for right now
      • Measure what you have right NOW
      • Don’t count on it getting any better
    6. Thank You HPC Industry!
      • Automated Stuff
      • Scalable Metric Collection/Display
      a lot of great deployment and management tricks come from them, adopted by web ops
    7. Good Measurement Tools
      • record and store
      • metrics in/out
      • custom metrics
      • easily compare
      • lightweight-ish
      I
    8. Clouds need planning too
      • Makes deployment and procurement easy and quick
      • But clouds are still resources with costs and limits, just like your own stuff
      • Black-boxes: you may need to pay even more attention than before
    9. Metrics
      • System Statistics
    10. Metrics
      • “Application” Level
      (photos processed per minute) (average processing time per photo) (apache requests) (concurrent busy apache procs)
    11. Metrics
      • App-level meets system-level
      here, total CPU = ~1.12 * # busy apache procs (ymmv)
    12. 2400 photos per minute being uploaded right NOW (Tuesday afternoon)
    13. Ceilings the most amount of “work” your resources will allow before degradation or failure
    14. Forget Benchmarking
    15. Find your ceilings what you have left The End
    16. Use real live production data to find ceilings Production: “it’s like a lab, but bigger!”
    17. Like: database ceilings replication lag: bad!
    18. Ceilings waiting on disk too much sustained disk I/O wait for >40% creates slave lag* *for us, YMMV
    19. 35,000 photo requests per second on a Tuesday peak
    20. Safety Factors
    21. Safety Factors Ceiling * Factor of Safety = UR LIMITZ
    22. Safety Factors webserver!
    23. “ safe” ceiling @85% CPU Safety Factors 85% total CPU = ~76 busy apache procs what you have left
    24. Safety Factors Yahoo Front Page link to Chinese NewYear Photos (photo requests/second) (8% spike)
    25. Forecasting
    26. Forecasting Fictional Example: webservers
    27. Forecasting Fictional example: 15 webservers. 1 week. peak of the week
    28. ...bigger sample, 6 weeks....isolate the peaks... Forecasting
    29. ...”Add a Trendline” with some decent correlation... Forecasting now not too shabby
    30. Forecasting 15 servers @76 busy apache proc limit = 1140 total procs when is this? this will tell you when it is ceiling what you have left
    31. Forecasting (week #10, duh) (1140-726) / 42.751 = 9.68
      • Writing excel macros is boring
      • All we want is “days remaining”, so all we need is the curve-fit
      Forecasting Automation Use http://fityk.sf.net to automate the curve-fit
    32. Forecasting Fictional Example: storage consumption
    33. Forecasting Automation actual flickr storage consumption from early 2005, in GB (ceiling is fictional) this will tell you when this is
    34. Forecasting Automation cmd line script output jallspaw:~]$cfityk ./fit-storage.fit 1> # Fityk script. Fityk version: 0.8.2 2> @0 < '/home/jallspaw/storage-consumption.xy' 15 points. No explicit std. dev. Set as sqrt(y) 3> guess Quadratic New function %_1 was created. 4> fit Initial values: lambda=0.001 WSSR=464.564 #1: WSSR=0.90162 lambda=0.0001 d(WSSR)=-463.663 (99.8059%) #2: WSSR=0.736787 lambda=1e-05 d(WSSR)=-0.164833 (18.2818%) #3: WSSR=0.736763 lambda=1e-06 d(WSSR)=-2.45151e-05 (0.00332729%) #4: WSSR=0.736763 lambda=1e-07 d(WSSR)=-3.84524e-11 (5.21909e-09%) Fit converged. Better fit found (WSSR = 0.736763, was 464.564, -99.8414%). 5> info formula in @0 # storage-consumption 14147.4+146.657*x+0.786854*x^2 6> quit bye...
    35. Forecasting Automation (SAME) fityk gave: y = 0.786854x 2 + 146.657x + 14147.4 ( R 2 = 99.84) Excel gave: y = 0.7675x 2 + 146.96x + 14147.3 ( R 2 = 99.84)
    36. Capacity Health
      • 12,629 nagios checks
      • 1314 hosts
      • 6 datacenters
      • 4 photo “farms”
      • farm = 2 DCs (east/west)
    37. High and Low Water Marks alert if higher alert if lower Per server, squid requests per second
    38. A good dashboard looks something like... (yes, fictional numbers) type # limit/box ceiling units limit (total) current (peak) % peak Est days left www 20 80 busy procs 1600 1000 62.50% 36 shard db 20 40 I/O wait 800 220 27.50% 120 squid 18 950 req/sec 17,100 11,400 66.67% 48
    39. Diagonal Scaling
      • Image processing machines
      • Replace Dell PE860s with HP DL140G3s
      vertically scaling your already horizontal nodes
    40. Diagonal Scaling example: image processing 4 cores 8 cores (about the same CPU “usage” per box)
    41. ~45 images/min @ peak ~140 images/min @ peak (same CPU usage, but ~3x more work) “ processing” means making 4 sizes from originals Diagonal Scaling example: image processing throughput
    42. Diagonal Scaling example: image processing 3008.4 Watts 1036.8 Watts went from: 23 Dell PE860s 8 HP DL140 G3s to: 1035 photos/min 1120 photos/min ( 75% faster, even) 23U rack 8U rack !!!
    43. 3.52 terabytes will be consumed today (on a Tuesday)
    44. 2nd Order Effects (beware the wandering bottleneck) running hot, so add more
    45. 2nd Order Effects (beware the wandering bottleneck) running great now, so more traffic! now these run hot
    46. Stupid Capacity Tricks
    47. Stupid Capacity Tricks quick and dirty management DSH http://freshmeat.net/projects/dsh [root@netmon101 ~]# cat group.of.servers www100 www118 dbcontacts3 admin1 admin2
    48. Stupid Capacity Tricks quick and dirty management [root@netmon101 ~]# dsh -N group.of.servers dsh> date executing 'date' www100: Mon Jun 23 14:14:53 UTC 2008 www118: Mon Jun 23 14:14:53 UTC 2008 dbcontacts3: Mon Jun 23 07:14:53 PDT 2008 admin1: Mon Jun 23 14:14:53 UTC 2008 admin2: Mon Jun 23 14:14:53 UTC 2008 dsh>
    49. Stupid Capacity Tricks Turn Stuff OFF
      • Disable heavy-ish features of the site(on/off switches)
        • We have 195 different things to disable in case of emergency.
    50. Stupid Capacity Tricks Turn Stuff OFF uploads (photo) uploads (video) uploads by email various API things various mobile things various search things etc., etc.
      • Host your outage/status/blog page in more than one datacenter.
      • Tell your users WTF is going on, they’ll appreciate it.
      Stupid Capacity Tricks Outages Happen
    51. Stupid Capacity Tricks Hit the Pause Button
      • Bake the dynamic into static
      • Some Y! properties have a big red button to instantly bake (and un-bake) at will
    52. thanks http://flickr.com/photos/bondidwhat/402089763/ http://flickr.com/photos/74876632@N00/2394833962/ http://flickr.com/photos/42311564@N00/220394633/ http://flickr.com/photos/unloveable/2422483859/ http://flickr.com/photos/absolutwade/149702085/ http://flickr.com/photos/krawiec/521836276/ http://flickr.com/photos/eschipul/1560875648/ http://flickr.com/photos/library_of_congress/2179060841/ http://flickr.com/photos/jekkyl/511187885/ http://flickr.com/photos/ab8wn/368021672/ http://flickr.com/photos/jaxxon/165559708/ http://flickr.com/photos/sparktography/75499095/
    53. We’re Hiring! flickr.com/jobs Come see me!
    54. questions?

    + techdudetechdude, 2 years ago

    custom

    3414 views, 4 favs, 1 embeds more stats

    Planning and managing capacity for a fast-growing w more

    More info about this document

    © All Rights Reserved

    Go to text version

    • Total Views 3414
      • 3407 on SlideShare
      • 7 from embeds
    • Comments 0
    • Favorites 4
    • Downloads 172
    Most viewed embeds
    • 7 views on http://www.kvaes.be

    more

    All embeds
    • 7 views on http://www.kvaes.be

    less

    Flagged as inappropriate Flag as inappropriate
    Flag as inappropriate

    Select your reason for flagging this presentation as inappropriate. If needed, use the feedback form to let us know more details.

    Cancel
    File a copyright complaint
    Having problems? Go to our helpdesk?

    Categories