Your SlideShare is downloading. ×
0
Capacity Management from Flickr
Capacity Management from Flickr
Capacity Management from Flickr
Capacity Management from Flickr
Capacity Management from Flickr
Capacity Management from Flickr
Capacity Management from Flickr
Capacity Management from Flickr
Capacity Management from Flickr
Capacity Management from Flickr
Capacity Management from Flickr
Capacity Management from Flickr
Capacity Management from Flickr
Capacity Management from Flickr
Capacity Management from Flickr
Capacity Management from Flickr
Capacity Management from Flickr
Capacity Management from Flickr
Capacity Management from Flickr
Capacity Management from Flickr
Capacity Management from Flickr
Capacity Management from Flickr
Capacity Management from Flickr
Capacity Management from Flickr
Capacity Management from Flickr
Capacity Management from Flickr
Capacity Management from Flickr
Capacity Management from Flickr
Capacity Management from Flickr
Capacity Management from Flickr
Capacity Management from Flickr
Capacity Management from Flickr
Capacity Management from Flickr
Capacity Management from Flickr
Capacity Management from Flickr
Capacity Management from Flickr
Capacity Management from Flickr
Capacity Management from Flickr
Capacity Management from Flickr
Capacity Management from Flickr
Capacity Management from Flickr
Capacity Management from Flickr
Capacity Management from Flickr
Capacity Management from Flickr
Capacity Management from Flickr
Capacity Management from Flickr
Capacity Management from Flickr
Capacity Management from Flickr
Capacity Management from Flickr
Capacity Management from Flickr
Capacity Management from Flickr
Capacity Management from Flickr
Capacity Management from Flickr
Capacity Management from Flickr
Capacity Management from Flickr
Capacity Management from Flickr
Capacity Management from Flickr
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Capacity Management from Flickr

1,012

Published on

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
1,012
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
12
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide
  • Only two more chapters to go. :)
  • How many of you manage servers for your site? How many of you know how many servers you have? (databases, webservers, etc.) How many of you collect metrics for all of your capacity resources?
  • I’ll be repeating some concepts that I’ve talked about in other presentations on the same topic... 1. Planning 2. Manage 3. Stupid Catastrophe Tricks with some random statistics from Flickr sprinkled throughout
  • By “capacity problems” I mean NORMAL capacity trends, not spiking ones. By edge cases, I mean usage patterns that exist outside the realm of normal operation. Examples: users with 60,000 tags on 20 photos (not possible anymore)...search API calls with 60 ORs (not possible anymore)
  • The High Performance Computing industry has created a lot of tools and deployment philosophies that web operations can learn from.
  • It *DOESN’T* matter which tool you use, as long as it can satisfy these criteria.
  • Knowing what system resources mean in terms of application usage puts the whole capacity shebang into context. Another example would be: Max QPS for a MySQL server = X users, Y photos.
  • (that’s about 44 per second.)
  • Artificial stress testing is rarely good for testing real capacity ceilings. It’s great for comparing two different hardware platforms, tho.
  • How many of you know how many QPS your MySQL machines can do without degrading or failing? (slave lag, anyone?)
  • Find ceilings by measuring *real* data from production. WHY?? 1. Development “cycles” are TIGHT, so code changes, so load characteristics change all the time. (sometimes in big ways) 2. Edge cases get shown in production, not in my imagination. (60k tags on 100 photos?) 3. Too much time wasted on artificial test setups to get accuracy that doesn’t matter.
  • Sometimes you don’t have to increase load artificially, you bump up against the limits naturally...
  • So, our ceiling is disk I/O wait, and it’s around 30-40% that we want to stay under... WHY does this happen? I don’t know, and I don’t care, not right now....
  • Squid requests per second, at peak, on a Tuesday.
  • Structural and mechanical engineering use a Factor of Safety (FoS) when designing components that experience load, both stress and strain: bridges, airbags, buildings, seatbelts, toasters. So should we, as web operations.
  • Whether you express it as a “reserve”, or “overhead”, or some fraction/percentage of your ultimate limit, you should know what these are for ALL of your resources. Civil, mechanical engineers use them when designing bridges, airbags, buildings, seatbelts, toasters. So should we.
  • Being a multi-core box, I know I can get up to 95% user CPU without performance degradation, but I don’t want to get that far. We’ll plan to have our webservers reach 85% so we have wiggle room for sporadic spikes above that.
  • Being a multi-core box, I know I can get up to 95% user CPU without performance degradation, but I don’t want to get that far. We’ll plan to have our webservers reach 85% so we have wiggle room for sporadic spikes above that. Why 85%? Because that is what history has told me I could see spikes of (15%)
  • Being a multi-core box, I know I can get up to 95% user CPU without performance degradation, but I don’t want to get that far. We’ll plan to have our webservers reach 85% so we have wiggle room for sporadic spikes above that.
  • Forecasting capacity means making educated guesses about the future by using data from the past. Throw in any knowledge you have about: - timing of feature launches - seasonal differences - new hardware deployment
  • We’ll just go through a simple example of using extrapolation and curve-fitting to make a prediction on how data (capacity metrics) will change in the future. THERE IS NO SUCH THING AS PREDICTING THE FUTURE.
  • RRDtool data, put into Excel. Need a bigger sample than 1 week, not enough peaks....let’s try 6 weeks...
  • ...try 6 weeks of data...
  • “ Add a Trendline” is a feature in Excel. Note the # of weeks at the bottom. The R-squared number is the “coefficient of determination” which indicates how good of a “fit” the equation is to the data.
  • This is a linear equation given for the curve-fitting function.
  • Using Excel is time-consuming, you should be able to automate this so you can keep tabs on it easier. fityk has a command-line version, cfityk.
  • Same drill with Excel.
  • The same! Yay!
  • High and low-water marks
  • ADD THE ESTIMATED TIME LEFT ON HERE...
  • Yay! Savings all around!
  • Terabytes will be consumed today, not including video.
  • Watch out for 2nd-order effects when deploying new/faster machines. When throttles are opened, the dam can get moved down the river.
  • Watch out for 2nd-order effects when deploying new/faster machines. When throttles are opened, the dam can get moved down the river. Artur mentioned that faster pages = more traffic...we see the same thing.
  • Some well-known tips and tricks for when the shit hits the fan.
  • Before capistrano, before puppet, there was dsh. Quick and dirty.
  • Running a command on any arbitrary number of hosts, interactively. Not revolutionary, but useful.
  • Better to be mostly up than down for features that aren’t used much.
  • Hosting is like $7.95 a month for a blog. Spend the cash.
  • We have put squid in front of our search cluster and cached aggressively when we ran close to capacity.
  • Transcript

    • 1. Capacity Management <ul><li>for Web Operations </li></ul>John Allspaw Operations Engineering
    • 2. the book I’m writing
    • 3. ???
    • 4. Rules of Thumb Planning/Forecasting Stupid Capacity Tricks (with some Flickr statistics sprinkled in)
    • 5. <ul><li>bugs (disguised as capacity problems) </li></ul><ul><li>edge cases (disguised as capacity problems) </li></ul><ul><li>security incidents </li></ul><ul><li>real capacity problems* </li></ul>* (should be the last thing you need to worry about) Things that can cause downtime
    • 6. Capacity != Performance <ul><li>Forget about performance for right now </li></ul><ul><li>Measure what you have right NOW </li></ul><ul><li>Don’t count on it getting any better </li></ul>
    • 7. Thank You HPC Industry! <ul><li>Automated Stuff </li></ul><ul><li>Scalable Metric Collection/Display </li></ul>a lot of great deployment and management tricks come from them, adopted by web ops
    • 8. Good Measurement Tools <ul><li>record and store </li></ul><ul><li>metrics in/out </li></ul><ul><li>custom metrics </li></ul><ul><li>easily compare </li></ul><ul><li>lightweight-ish </li></ul>I
    • 9. Clouds need planning too <ul><li>Makes deployment and procurement easy and quick </li></ul><ul><li>But clouds are still resources with costs and limits, just like your own stuff </li></ul><ul><li>Black-boxes: you may need to pay even more attention than before </li></ul>
    • 10. Metrics <ul><li>System Statistics </li></ul>
    • 11. Metrics <ul><li>“Application” Level </li></ul>(photos processed per minute) (average processing time per photo) (apache requests) (concurrent busy apache procs)
    • 12. Metrics <ul><li>App-level meets system-level </li></ul>here, total CPU = ~1.12 * # busy apache procs (ymmv)
    • 13. 2400 photos per minute being uploaded right NOW (Tuesday afternoon)
    • 14. Ceilings the most amount of “work” your resources will allow before degradation or failure
    • 15. Forget Benchmarking
    • 16. Find your ceilings what you have left The End
    • 17. Use real live production data to find ceilings Production: “it’s like a lab, but bigger!”
    • 18. Like: database ceilings replication lag: bad!
    • 19. Ceilings waiting on disk too much sustained disk I/O wait for &gt;40% creates slave lag* *for us, YMMV
    • 20. 35,000 photo requests per second on a Tuesday peak
    • 21. Safety Factors
    • 22. Safety Factors Ceiling * Factor of Safety = UR LIMITZ
    • 23. Safety Factors webserver!
    • 24. “ safe” ceiling @85% CPU Safety Factors 85% total CPU = ~76 busy apache procs what you have left
    • 25. Safety Factors Yahoo Front Page link to Chinese NewYear Photos (photo requests/second) (8% spike)
    • 26. Forecasting
    • 27. Forecasting Fictional Example: webservers
    • 28. Forecasting Fictional example: 15 webservers. 1 week. peak of the week
    • 29. ...bigger sample, 6 weeks....isolate the peaks... Forecasting
    • 30. ...”Add a Trendline” with some decent correlation... Forecasting now not too shabby
    • 31. Forecasting 15 servers @76 busy apache proc limit = 1140 total procs when is this? this will tell you when it is ceiling what you have left
    • 32. Forecasting (week #10, duh) (1140-726) / 42.751 = 9.68
    • 33. <ul><li>Writing excel macros is boring </li></ul><ul><li>All we want is “days remaining”, so all we need is the curve-fit </li></ul>Forecasting Automation Use http://fityk.sf.net to automate the curve-fit
    • 34. Forecasting Fictional Example: storage consumption
    • 35. Forecasting Automation actual flickr storage consumption from early 2005, in GB (ceiling is fictional) this will tell you when this is
    • 36. Forecasting Automation cmd line script output jallspaw:~]$cfityk ./fit-storage.fit 1&gt; # Fityk script. Fityk version: 0.8.2 2&gt; @0 &lt; &apos;/home/jallspaw/storage-consumption.xy&apos; 15 points. No explicit std. dev. Set as sqrt(y) 3&gt; guess Quadratic New function %_1 was created. 4&gt; fit Initial values: lambda=0.001 WSSR=464.564 #1: WSSR=0.90162 lambda=0.0001 d(WSSR)=-463.663 (99.8059%) #2: WSSR=0.736787 lambda=1e-05 d(WSSR)=-0.164833 (18.2818%) #3: WSSR=0.736763 lambda=1e-06 d(WSSR)=-2.45151e-05 (0.00332729%) #4: WSSR=0.736763 lambda=1e-07 d(WSSR)=-3.84524e-11 (5.21909e-09%) Fit converged. Better fit found (WSSR = 0.736763, was 464.564, -99.8414%). 5&gt; info formula in @0 # storage-consumption 14147.4+146.657*x+0.786854*x^2 6&gt; quit bye...
    • 37. Forecasting Automation (SAME) fityk gave: y = 0.786854x 2 + 146.657x + 14147.4 ( R 2 = 99.84) Excel gave: y = 0.7675x 2 + 146.96x + 14147.3 ( R 2 = 99.84)
    • 38. Capacity Health <ul><li>12,629 nagios checks </li></ul><ul><li>1314 hosts </li></ul><ul><li>6 datacenters </li></ul><ul><li>4 photo “farms” </li></ul><ul><li>farm = 2 DCs (east/west) </li></ul>
    • 39. High and Low Water Marks alert if higher alert if lower Per server, squid requests per second
    • 40. A good dashboard looks something like... (yes, fictional numbers) type # limit/box ceiling units limit (total) current (peak) % peak Est days left www 20 80 busy procs 1600 1000 62.50% 36 shard db 20 40 I/O wait 800 220 27.50% 120 squid 18 950 req/sec 17,100 11,400 66.67% 48
    • 41. Diagonal Scaling <ul><li>Image processing machines </li></ul><ul><li>Replace Dell PE860s with HP DL140G3s </li></ul>vertically scaling your already horizontal nodes
    • 42. Diagonal Scaling example: image processing 4 cores 8 cores (about the same CPU “usage” per box)
    • 43. ~45 images/min @ peak ~140 images/min @ peak (same CPU usage, but ~3x more work) “ processing” means making 4 sizes from originals Diagonal Scaling example: image processing throughput
    • 44. Diagonal Scaling example: image processing 3008.4 Watts 1036.8 Watts went from: 23 Dell PE860s 8 HP DL140 G3s to: 1035 photos/min 1120 photos/min ( 75% faster, even) 23U rack 8U rack !!!
    • 45. 3.52 terabytes will be consumed today (on a Tuesday)
    • 46. 2nd Order Effects (beware the wandering bottleneck) running hot, so add more
    • 47. 2nd Order Effects (beware the wandering bottleneck) running great now, so more traffic! now these run hot
    • 48. Stupid Capacity Tricks
    • 49. Stupid Capacity Tricks quick and dirty management DSH http://freshmeat.net/projects/dsh [root@netmon101 ~]# cat group.of.servers www100 www118 dbcontacts3 admin1 admin2
    • 50. Stupid Capacity Tricks quick and dirty management [root@netmon101 ~]# dsh -N group.of.servers dsh&gt; date executing &apos;date&apos; www100: Mon Jun 23 14:14:53 UTC 2008 www118: Mon Jun 23 14:14:53 UTC 2008 dbcontacts3: Mon Jun 23 07:14:53 PDT 2008 admin1: Mon Jun 23 14:14:53 UTC 2008 admin2: Mon Jun 23 14:14:53 UTC 2008 dsh&gt;
    • 51. Stupid Capacity Tricks Turn Stuff OFF <ul><li>Disable heavy-ish features of the site(on/off switches) </li></ul><ul><ul><li>We have 195 different things to disable in case of emergency. </li></ul></ul>
    • 52. Stupid Capacity Tricks Turn Stuff OFF uploads (photo) uploads (video) uploads by email various API things various mobile things various search things etc., etc.
    • 53. <ul><li>Host your outage/status/blog page in more than one datacenter. </li></ul><ul><li>Tell your users WTF is going on, they’ll appreciate it. </li></ul>Stupid Capacity Tricks Outages Happen
    • 54. Stupid Capacity Tricks Hit the Pause Button <ul><li>Bake the dynamic into static </li></ul><ul><li>Some Y! properties have a big red button to instantly bake (and un-bake) at will </li></ul>
    • 55. thanks http://flickr.com/photos/bondidwhat/402089763/ http://flickr.com/photos/74876632@N00/2394833962/ http://flickr.com/photos/42311564@N00/220394633/ http://flickr.com/photos/unloveable/2422483859/ http://flickr.com/photos/absolutwade/149702085/ http://flickr.com/photos/krawiec/521836276/ http://flickr.com/photos/eschipul/1560875648/ http://flickr.com/photos/library_of_congress/2179060841/ http://flickr.com/photos/jekkyl/511187885/ http://flickr.com/photos/ab8wn/368021672/ http://flickr.com/photos/jaxxon/165559708/ http://flickr.com/photos/sparktography/75499095/
    • 56. We’re Hiring! flickr.com/jobs Come see me!
    • 57. questions?

    ×