Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Capacity
Management
for Web Operations




                         John Allspaw
                     Operations Engineering
the book I’m writing
???
Rules of Thumb

                Planning/Forecasting

               Stupid Capacity Tricks



(with some Flickr statistic...
Things that can cause downtime

       bugs (disguised as capacity problems)
       edge cases (disguised as capacity prob...
Capacity != Performance


Forget about performance for right
now
Measure what you have right NOW
Don’t count on it getting...
Thank You HPC Industry!

    Automated Stuff
    Scalable Metric Collection/Display




a lot of great deployment and mana...
Good
Measurement
   Tools
  record and
  store
  metrics in/out
  custom metrics
  easily compare
  lightweight-ish

 I
Clouds need planning too

Makes deployment and procurement
easy and quick
But clouds are still resources with
costs and li...
Metrics
System Statistics
Metrics
“Application” Level
                        (photos processed per minute)




                            (average...
Metrics
App-level meets system-level




here, total CPU = ~1.12 * # busy apache procs (ymmv)
2400

photos per minute being uploaded right NOW (Tuesday afternoon)
Ceilings
    the most amount of “work” your
resources will allow before degradation
or failure
Forget Benchmarking
Find your ceilings




           what you have left

                     The End
Use real live production data
       to find ceilings




   Production: “it’s like a lab, but bigger!”
Like: database ceilings




           replication lag: bad!
Ceilings




waiting on disk sustained disk I/O wait for
  too much             >40% creates
                         slav...
35,000
photo requests per second on a Tuesday peak
Safety Factors
Safety Factors



Ceiling * Factor of Safety = UR LIMITZ
Safety Factors




   webserver!
Safety Factors
          what you have left




                                         “safe”
                          ...
Safety Factors
                         Yahoo Front Page
                     link to Chinese NewYear
                    ...
Forecasting
Forecasting


Fictional Example:
    webservers
Forecasting

                         peak of the week




Fictional example: 15 webservers. 1 week.
Forecasting




...bigger sample, 6 weeks....isolate the peaks...
Forecasting

                            not too shabby




                 now



...”Add a Trendline” with some decent ...
Forecasting

                   this will tell you when it is
      ceiling


                                       when ...
Forecasting



(1140-726) / 42.751 = 9.68


      (week #10, duh)
Forecasting Automation


Writing excel macros is boring
All we want is “days remaining”, so
all we need is the curve-fit

 ...
Forecasting


  Fictional Example:
storage consumption
Forecasting Automation


                 this will tell
               you when this is




actual flickr storage consumpt...
Forecasting Automation
jallspaw:~]$cfityk ./fit-storage.fit                 cmd line script
1> # Fityk script. Fityk version:...
Forecasting Automation
fityk gave:
      y = 0.786854x2 + 146.657x + 14147.4
                  ( R2 = 99.84)
Excel gave:
  ...
Capacity Health

12,629 nagios checks
1314 hosts
6 datacenters
4 photo “farms”
farm = 2 DCs (east/west)
High and Low Water Marks

      alert if higher




      alert if lower



Per server, squid requests per second
A good dashboard looks
                something like...

                                                                ...
Diagonal Scaling

vertically scaling your already horizontal nodes




 Image processing machines
 Replace Dell PE860s wit...
Diagonal Scaling
     example: image processing


                              4 cores




                              ...
Diagonal Scaling
    example: image processing throughput


                            ~45 images/min @ peak




        ...
Diagonal Scaling
               example: image processing
went from:
                         3008.4         1035        2...
3.52




terabytes will be consumed today (on a Tuesday)
2nd Order Effects
(beware the wandering bottleneck)



               LB
                          running hot,
          ...
2nd Order Effects
    (beware the wandering bottleneck)



                             LB               running great now...
Stupid Capacity Tricks
Stupid Capacity Tricks
      quick and dirty management
                   DSH
     http://freshmeat.net/projects/dsh

[ro...
Stupid Capacity Tricks
         quick and dirty management


[root@netmon101 ~]# dsh -N group.of.servers

dsh> date
execut...
Stupid Capacity Tricks
        Turn Stuff OFF


Disable heavy-ish features of the site
          (on/off switches)
  We ha...
Stupid Capacity Tricks
     Turn Stuff OFF

       uploads (photo)
       uploads (video)
       uploads by email
      va...
Stupid Capacity Tricks
       Outages Happen

Host your outage/status/blog page in
    more than one datacenter.
Tell your...
Stupid Capacity Tricks
     Hit the Pause Button



Bake the dynamic into static
Some Y! properties have a big red
button ...
thanks
http://flickr.com/photos/bondidwhat/402089763/
http://flickr.com/photos/74876632@N00/2394833962/
http://flickr.com/pho...
We’re Hiring!
flickr.com/jobs


Come see me!
questions?
You’ve finished this document.
Download and read it offline.
Upcoming SlideShare
Organization and Administration in Guidance
Next
Upcoming SlideShare
Organization and Administration in Guidance
Next
Download to read offline and view in fullscreen.

Share

Capacity Management for Web Operations

Download to read offline

Related Books

Free with a 30 day trial from Scribd

See all

Related Audiobooks

Free with a 30 day trial from Scribd

See all

Capacity Management for Web Operations

  1. Capacity Management for Web Operations John Allspaw Operations Engineering
  2. the book I’m writing
  3. ???
  4. Rules of Thumb Planning/Forecasting Stupid Capacity Tricks (with some Flickr statistics sprinkled in)
  5. Things that can cause downtime bugs (disguised as capacity problems) edge cases (disguised as capacity problems) security incidents real capacity problems* * (should be the last thing you need to worry about)
  6. Capacity != Performance Forget about performance for right now Measure what you have right NOW Don’t count on it getting any better
  7. Thank You HPC Industry! Automated Stuff Scalable Metric Collection/Display a lot of great deployment and management tricks come from them, adopted by web ops
  8. Good Measurement Tools record and store metrics in/out custom metrics easily compare lightweight-ish I
  9. Clouds need planning too Makes deployment and procurement easy and quick But clouds are still resources with costs and limits, just like your own stuff Black-boxes: you may need to pay even more attention than before
  10. Metrics System Statistics
  11. Metrics “Application” Level (photos processed per minute) (average processing time per photo) (apache requests) (concurrent busy apache procs)
  12. Metrics App-level meets system-level here, total CPU = ~1.12 * # busy apache procs (ymmv)
  13. 2400 photos per minute being uploaded right NOW (Tuesday afternoon)
  14. Ceilings the most amount of “work” your resources will allow before degradation or failure
  15. Forget Benchmarking
  16. Find your ceilings what you have left The End
  17. Use real live production data to find ceilings Production: “it’s like a lab, but bigger!”
  18. Like: database ceilings replication lag: bad!
  19. Ceilings waiting on disk sustained disk I/O wait for too much >40% creates slave lag* *for us,YMMV
  20. 35,000 photo requests per second on a Tuesday peak
  21. Safety Factors
  22. Safety Factors Ceiling * Factor of Safety = UR LIMITZ
  23. Safety Factors webserver!
  24. Safety Factors what you have left “safe” ceiling @85% CPU 85% total CPU = ~76 busy apache procs
  25. Safety Factors Yahoo Front Page link to Chinese NewYear Photos (8% spike) (photo requests/second)
  26. Forecasting
  27. Forecasting Fictional Example: webservers
  28. Forecasting peak of the week Fictional example: 15 webservers. 1 week.
  29. Forecasting ...bigger sample, 6 weeks....isolate the peaks...
  30. Forecasting not too shabby now ...”Add a Trendline” with some decent correlation...
  31. Forecasting this will tell you when it is ceiling when is this? what you have left 15 servers @76 busy apache proc limit = 1140 total procs
  32. Forecasting (1140-726) / 42.751 = 9.68 (week #10, duh)
  33. Forecasting Automation Writing excel macros is boring All we want is “days remaining”, so all we need is the curve-fit Use http://fityk.sf.net to automate the curve-fit
  34. Forecasting Fictional Example: storage consumption
  35. Forecasting Automation this will tell you when this is actual flickr storage consumption from early 2005, in GB (ceiling is fictional)
  36. Forecasting Automation jallspaw:~]$cfityk ./fit-storage.fit cmd line script 1> # Fityk script. Fityk version: 0.8.2 output 2> @0 < '/home/jallspaw/storage-consumption.xy' 15 points. No explicit std. dev. Set as sqrt(y) 3> guess Quadratic New function %_1 was created. 4> fit Initial values: lambda=0.001 WSSR=464.564 #1: WSSR=0.90162 lambda=0.0001 d(WSSR)=-463.663 (99.8059%) #2: WSSR=0.736787 lambda=1e-05 d(WSSR)=-0.164833 (18.2818%) #3: WSSR=0.736763 lambda=1e-06 d(WSSR)=-2.45151e-05 (0.00332729%) #4: WSSR=0.736763 lambda=1e-07 d(WSSR)=-3.84524e-11 (5.21909e-09%) Fit converged. Better fit found (WSSR = 0.736763, was 464.564, -99.8414%). 5> info formula in @0 # storage-consumption 14147.4+146.657*x+0.786854*x^2 6> quit bye...
  37. Forecasting Automation fityk gave: y = 0.786854x2 + 146.657x + 14147.4 ( R2 = 99.84) Excel gave: y = 0.7675x2 + 146.96x + 14147.3 ( R2 = 99.84) (SAME)
  38. Capacity Health 12,629 nagios checks 1314 hosts 6 datacenters 4 photo “farms” farm = 2 DCs (east/west)
  39. High and Low Water Marks alert if higher alert if lower Per server, squid requests per second
  40. A good dashboard looks something like... Est limit/ ceiling limit current % days type # box units (total) (peak) peak left busy www 20 80 1600 1000 62.50% 36 procs shard I/O 20 40 800 220 27.50% 120 db wait squid 18 950 req/sec 17,100 11,400 66.67% 48 (yes, fictional numbers)
  41. Diagonal Scaling vertically scaling your already horizontal nodes Image processing machines Replace Dell PE860s with HP DL140G3s
  42. Diagonal Scaling example: image processing 4 cores 8 cores (about the same CPU “usage” per box)
  43. Diagonal Scaling example: image processing throughput ~45 images/min @ peak ~140 images/min @ peak (same CPU usage, but ~3x more work) “processing” means making 4 sizes from originals
  44. Diagonal Scaling example: image processing went from: 3008.4 1035 23U 23 Dell PE860s Watts photos/min rack to: 8 HP DL140 G3s 1036.8 Watts 1120 photos/min 8U rack !!! (75% faster, even)
  45. 3.52 terabytes will be consumed today (on a Tuesday)
  46. 2nd Order Effects (beware the wandering bottleneck) LB running hot, so add more www www db search memcached
  47. 2nd Order Effects (beware the wandering bottleneck) LB running great now, so more traffic! now these run www www www www hot db search memcached
  48. Stupid Capacity Tricks
  49. Stupid Capacity Tricks quick and dirty management DSH http://freshmeat.net/projects/dsh [root@netmon101 ~]# cat group.of.servers www100 www118 dbcontacts3 admin1 admin2
  50. Stupid Capacity Tricks quick and dirty management [root@netmon101 ~]# dsh -N group.of.servers dsh> date executing 'date' www100: Mon Jun 23 14:14:53 UTC 2008 www118: Mon Jun 23 14:14:53 UTC 2008 dbcontacts3: Mon Jun 23 07:14:53 PDT 2008 admin1: Mon Jun 23 14:14:53 UTC 2008 admin2: Mon Jun 23 14:14:53 UTC 2008 dsh>
  51. Stupid Capacity Tricks Turn Stuff OFF Disable heavy-ish features of the site (on/off switches) We have 195 different things to disable in case of emergency.
  52. Stupid Capacity Tricks Turn Stuff OFF uploads (photo) uploads (video) uploads by email various API things various mobile things various search things etc., etc.
  53. Stupid Capacity Tricks Outages Happen Host your outage/status/blog page in more than one datacenter. Tell your users WTF is going on, they’ll appreciate it.
  54. Stupid Capacity Tricks Hit the Pause Button Bake the dynamic into static Some Y! properties have a big red button to instantly bake (and un- bake) at will
  55. thanks http://flickr.com/photos/bondidwhat/402089763/ http://flickr.com/photos/74876632@N00/2394833962/ http://flickr.com/photos/42311564@N00/220394633/ http://flickr.com/photos/unloveable/2422483859/ http://flickr.com/photos/absolutwade/149702085/ http://flickr.com/photos/krawiec/521836276/ http://flickr.com/photos/eschipul/1560875648/ http://flickr.com/photos/library_of_congress/2179060841/ http://flickr.com/photos/jekkyl/511187885/ http://flickr.com/photos/ab8wn/368021672/ http://flickr.com/photos/jaxxon/165559708/ http://flickr.com/photos/sparktography/75499095/
  56. We’re Hiring! flickr.com/jobs Come see me!
  57. questions?
  • SubhajitMondal54

    Sep. 26, 2020
  • ricardoamaro

    Mar. 22, 2019
  • zahernourredine

    Feb. 15, 2019
  • rinehartas

    Jan. 10, 2018
  • williammdavis

    Feb. 17, 2017
  • flunardelli

    Jan. 5, 2016
  • MattCauser

    Oct. 15, 2015
  • Whippet79

    Jul. 2, 2015
  • JawhnyCooke

    Jun. 26, 2015
  • yusufrawat

    Jun. 9, 2015
  • wirelessinternet

    Dec. 1, 2014
  • AsisUnyapoth

    Nov. 17, 2013
  • hisayoshi

    Jun. 6, 2013
  • abdul

    Jan. 16, 2013
  • mudduch

    Jan. 16, 2013
  • dennisgreenlieber

    Dec. 30, 2012
  • bcurlee

    Dec. 9, 2012
  • Tomz

    Sep. 24, 2012
  • seshu001

    Apr. 29, 2012
  • JuniorZ

    Oct. 28, 2011

Views

Total views

50,322

On Slideshare

0

From embeds

0

Number of embeds

159

Actions

Downloads

898

Shares

0

Comments

0

Likes

81

×