This document discusses various strategies and techniques for managing capacity for web operations. It begins by discussing the importance of measuring current capacity and finding ceilings rather than relying on benchmarks. It then covers topics like forecasting future capacity needs, establishing safety factors, diagonal scaling to improve performance, and dealing with capacity issues through techniques like distributed shell commands, disabling features, baking dynamic content static, and scheduling outages. Stupid but effective capacity tricks are also mentioned.
4. Rules of Thumb
Planning/Forecasting
Stupid Capacity Tricks
(with some Flickr statistics sprinkled in)
5. Things that can cause downtime
bugs (disguised as capacity problems)
edge cases (disguised as capacity problems)
security incidents
real capacity problems*
* (should be the last thing you need to worry about)
6. Capacity != Performance
Forget about performance for right
now
Measure what you have right NOW
Don’t count on it getting any better
7. Thank You HPC Industry!
Automated Stuff
Scalable Metric Collection/Display
a lot of great deployment and management tricks
come from them, adopted by web ops
8. Good
Measurement
Tools
record and
store
metrics in/out
custom metrics
easily compare
lightweight-ish
I
9. Clouds need planning too
Makes deployment and procurement
easy and quick
But clouds are still resources with
costs and limits, just like your own
stuff
Black-boxes: you may need to pay
even more attention than before
33. Forecasting Automation
Writing excel macros is boring
All we want is “days remaining”, so
all we need is the curve-fit
Use http://fityk.sf.net to
automate the curve-fit
42. Diagonal Scaling
example: image processing
4 cores
8 cores
(about the same CPU “usage” per box)
43. Diagonal Scaling
example: image processing throughput
~45 images/min @ peak
~140 images/min @ peak
(same CPU usage, but ~3x more work)
“processing” means making 4 sizes from originals
50. Stupid Capacity Tricks
quick and dirty management
[root@netmon101 ~]# dsh -N group.of.servers
dsh> date
executing 'date'
www100: Mon Jun 23 14:14:53 UTC 2008
www118: Mon Jun 23 14:14:53 UTC 2008
dbcontacts3: Mon Jun 23 07:14:53 PDT 2008
admin1: Mon Jun 23 14:14:53 UTC 2008
admin2: Mon Jun 23 14:14:53 UTC 2008
dsh>
51. Stupid Capacity Tricks
Turn Stuff OFF
Disable heavy-ish features of the site
(on/off switches)
We have 195 different things to
disable in case of emergency.
52. Stupid Capacity Tricks
Turn Stuff OFF
uploads (photo)
uploads (video)
uploads by email
various API things
various mobile things
various search things
etc., etc.
53. Stupid Capacity Tricks
Outages Happen
Host your outage/status/blog page in
more than one datacenter.
Tell your users WTF is going on,
they’ll appreciate it.
54. Stupid Capacity Tricks
Hit the Pause Button
Bake the dynamic into static
Some Y! properties have a big red
button to instantly bake (and un-
bake) at will