Capacity Management for Web Operations
Upcoming SlideShare
Loading in...5
×
 

Like this? Share it with your network

Share

Capacity Management for Web Operations

on

  • 31,137 views

 

Statistics

Views

Total Views
31,137
Views on SlideShare
31,069
Embed Views
68

Actions

Likes
70
Downloads
820
Comments
2

12 Embeds 68

http://www.slideshare.net 27
http://www.linkedin.com 11
http://www.webanddb.com 8
https://www.linkedin.com 6
http://www.deinnovatie.com 4
http://h30507.www3.hp.com 4
http://192.168.6.52 3
http://industryrobotics.com 1
http://robotsforsale.org 1
http://rnaimedicine.com 1
http://static.slideshare.net 1
http://s3.amazonaws.com 1
More...

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Capacity Management for Web Operations Presentation Transcript

  • 1. Capacity Management for Web Operations John Allspaw Operations Engineering
  • 2. the book I’m writing
  • 3. ???
  • 4. Rules of Thumb Planning/Forecasting Stupid Capacity Tricks (with some Flickr statistics sprinkled in)
  • 5. Things that can cause downtime bugs (disguised as capacity problems) edge cases (disguised as capacity problems) security incidents real capacity problems* * (should be the last thing you need to worry about)
  • 6. Capacity != Performance Forget about performance for right now Measure what you have right NOW Don’t count on it getting any better
  • 7. Thank You HPC Industry! Automated Stuff Scalable Metric Collection/Display a lot of great deployment and management tricks come from them, adopted by web ops
  • 8. Good Measurement Tools record and store metrics in/out custom metrics easily compare lightweight-ish I
  • 9. Clouds need planning too Makes deployment and procurement easy and quick But clouds are still resources with costs and limits, just like your own stuff Black-boxes: you may need to pay even more attention than before
  • 10. Metrics System Statistics
  • 11. Metrics “Application” Level (photos processed per minute) (average processing time per photo) (apache requests) (concurrent busy apache procs)
  • 12. Metrics App-level meets system-level here, total CPU = ~1.12 * # busy apache procs (ymmv)
  • 13. 2400 photos per minute being uploaded right NOW (Tuesday afternoon)
  • 14. Ceilings the most amount of “work” your resources will allow before degradation or failure
  • 15. Forget Benchmarking
  • 16. Find your ceilings what you have left The End
  • 17. Use real live production data to find ceilings Production: “it’s like a lab, but bigger!”
  • 18. Like: database ceilings replication lag: bad!
  • 19. Ceilings waiting on disk sustained disk I/O wait for too much >40% creates slave lag* *for us,YMMV
  • 20. 35,000 photo requests per second on a Tuesday peak
  • 21. Safety Factors
  • 22. Safety Factors Ceiling * Factor of Safety = UR LIMITZ
  • 23. Safety Factors webserver!
  • 24. Safety Factors what you have left “safe” ceiling @85% CPU 85% total CPU = ~76 busy apache procs
  • 25. Safety Factors Yahoo Front Page link to Chinese NewYear Photos (8% spike) (photo requests/second)
  • 26. Forecasting
  • 27. Forecasting Fictional Example: webservers
  • 28. Forecasting peak of the week Fictional example: 15 webservers. 1 week.
  • 29. Forecasting ...bigger sample, 6 weeks....isolate the peaks...
  • 30. Forecasting not too shabby now ...”Add a Trendline” with some decent correlation...
  • 31. Forecasting this will tell you when it is ceiling when is this? what you have left 15 servers @76 busy apache proc limit = 1140 total procs
  • 32. Forecasting (1140-726) / 42.751 = 9.68 (week #10, duh)
  • 33. Forecasting Automation Writing excel macros is boring All we want is “days remaining”, so all we need is the curve-fit Use http://fityk.sf.net to automate the curve-fit
  • 34. Forecasting Fictional Example: storage consumption
  • 35. Forecasting Automation this will tell you when this is actual flickr storage consumption from early 2005, in GB (ceiling is fictional)
  • 36. Forecasting Automation jallspaw:~]$cfityk ./fit-storage.fit cmd line script 1> # Fityk script. Fityk version: 0.8.2 output 2> @0 < '/home/jallspaw/storage-consumption.xy' 15 points. No explicit std. dev. Set as sqrt(y) 3> guess Quadratic New function %_1 was created. 4> fit Initial values: lambda=0.001 WSSR=464.564 #1: WSSR=0.90162 lambda=0.0001 d(WSSR)=-463.663 (99.8059%) #2: WSSR=0.736787 lambda=1e-05 d(WSSR)=-0.164833 (18.2818%) #3: WSSR=0.736763 lambda=1e-06 d(WSSR)=-2.45151e-05 (0.00332729%) #4: WSSR=0.736763 lambda=1e-07 d(WSSR)=-3.84524e-11 (5.21909e-09%) Fit converged. Better fit found (WSSR = 0.736763, was 464.564, -99.8414%). 5> info formula in @0 # storage-consumption 14147.4+146.657*x+0.786854*x^2 6> quit bye...
  • 37. Forecasting Automation fityk gave: y = 0.786854x2 + 146.657x + 14147.4 ( R2 = 99.84) Excel gave: y = 0.7675x2 + 146.96x + 14147.3 ( R2 = 99.84) (SAME)
  • 38. Capacity Health 12,629 nagios checks 1314 hosts 6 datacenters 4 photo “farms” farm = 2 DCs (east/west)
  • 39. High and Low Water Marks alert if higher alert if lower Per server, squid requests per second
  • 40. A good dashboard looks something like... Est limit/ ceiling limit current % days type # box units (total) (peak) peak left busy www 20 80 1600 1000 62.50% 36 procs shard I/O 20 40 800 220 27.50% 120 db wait squid 18 950 req/sec 17,100 11,400 66.67% 48 (yes, fictional numbers)
  • 41. Diagonal Scaling vertically scaling your already horizontal nodes Image processing machines Replace Dell PE860s with HP DL140G3s
  • 42. Diagonal Scaling example: image processing 4 cores 8 cores (about the same CPU “usage” per box)
  • 43. Diagonal Scaling example: image processing throughput ~45 images/min @ peak ~140 images/min @ peak (same CPU usage, but ~3x more work) “processing” means making 4 sizes from originals
  • 44. Diagonal Scaling example: image processing went from: 3008.4 1035 23U 23 Dell PE860s Watts photos/min rack to: 8 HP DL140 G3s 1036.8 Watts 1120 photos/min 8U rack !!! (75% faster, even)
  • 45. 3.52 terabytes will be consumed today (on a Tuesday)
  • 46. 2nd Order Effects (beware the wandering bottleneck) LB running hot, so add more www www db search memcached
  • 47. 2nd Order Effects (beware the wandering bottleneck) LB running great now, so more traffic! now these run www www www www hot db search memcached
  • 48. Stupid Capacity Tricks
  • 49. Stupid Capacity Tricks quick and dirty management DSH http://freshmeat.net/projects/dsh [root@netmon101 ~]# cat group.of.servers www100 www118 dbcontacts3 admin1 admin2
  • 50. Stupid Capacity Tricks quick and dirty management [root@netmon101 ~]# dsh -N group.of.servers dsh> date executing 'date' www100: Mon Jun 23 14:14:53 UTC 2008 www118: Mon Jun 23 14:14:53 UTC 2008 dbcontacts3: Mon Jun 23 07:14:53 PDT 2008 admin1: Mon Jun 23 14:14:53 UTC 2008 admin2: Mon Jun 23 14:14:53 UTC 2008 dsh>
  • 51. Stupid Capacity Tricks Turn Stuff OFF Disable heavy-ish features of the site (on/off switches) We have 195 different things to disable in case of emergency.
  • 52. Stupid Capacity Tricks Turn Stuff OFF uploads (photo) uploads (video) uploads by email various API things various mobile things various search things etc., etc.
  • 53. Stupid Capacity Tricks Outages Happen Host your outage/status/blog page in more than one datacenter. Tell your users WTF is going on, they’ll appreciate it.
  • 54. Stupid Capacity Tricks Hit the Pause Button Bake the dynamic into static Some Y! properties have a big red button to instantly bake (and un- bake) at will
  • 55. thanks http://flickr.com/photos/bondidwhat/402089763/ http://flickr.com/photos/74876632@N00/2394833962/ http://flickr.com/photos/42311564@N00/220394633/ http://flickr.com/photos/unloveable/2422483859/ http://flickr.com/photos/absolutwade/149702085/ http://flickr.com/photos/krawiec/521836276/ http://flickr.com/photos/eschipul/1560875648/ http://flickr.com/photos/library_of_congress/2179060841/ http://flickr.com/photos/jekkyl/511187885/ http://flickr.com/photos/ab8wn/368021672/ http://flickr.com/photos/jaxxon/165559708/ http://flickr.com/photos/sparktography/75499095/
  • 56. We’re Hiring! flickr.com/jobs Come see me!
  • 57. questions?