From Zero
To Capacity Planning
@Randommood
INES

Sombra
Globallydistributed and Highly available
Whycapacity
planning?
Or a journey of discovery and ingenuity
The views reflected in this talk
are not to be considered a
reflection of the skills of my
coworkers who are extremely
nice human beings and way
better at capacity planning
than I am.
😜
NOTAmonitoring
person
💀
🚨🚨
INSTRUMENT
MONITOR &
ALERT
PLAN
&
PREDICT
The Road to Capacity planning
?
FindingsBooks
0
Day One
Some Learning
Our Discoveries
Rituals
&Myths
Asking Around
Bringing it Home
our Path today
Checking The
Edge
zero… Oh shit!
aconvenient”situation”
Handles State
Many Clients
Othersystemsdependonthisservicetobe:up,healthy,andavailable!
A bit F*cked
Our 

World
Edge Core✨ ✨
a Fastly POP
I Rule the
Edge!
Evaluates weekly global
POPs performance &
makes projections
Publishes capacity
performance report in
clear location
Plans for our physical
capacity & transit
capacity
Meet Catharine
Planning Our Capacity
Some metrics
- Network Capacity (Gb) 

- Ordered Network Capability (Gb) 

- Planned Network Capacity (Gb)

- RPS Capacity (k) 

- Network peak (Gb) 

- RPS peak (k) 

- Site CPU Peak (%) 

- Network Utilization (%)
Over 30%: flagged, Over 70%:
Red status
Edge Insights
Our ability to correctly plan for
capacity is critical to our
bottom line
Capacity doesn’t just involve
hardware; software
optimizations matter
People affect capacity
Hitting
The
Books
Defining Capacity planning
Measuring, planning, & managing system growth
Determines what your system needs & when
From the observation of actual traffic. Use current
performance as baseline.
Must happen regardless of what you might
optimize
ARE
WE RIGHT
NOW?
We have to be
this fast & reliable 

X per second & Y%
Uptime
MEASURE HOW/RELIABLE WE ARE
HARDWARE
SOFTWARE
ARCHITECTURE
CHANGE / ADD / REMOVE
FIGURE OUT
HOW TO STAY
FAST/RELIABLE
ENOUGH
Yes!
No!
Allspaw's Wisdom
From The Art of Capacity Planning
👈
System’s Ceiling: critical level of a
resource that cannot be crossed
without failure. Find yours
Another form of Capacity Planning:
Controlled load testing
Predictions: ceilings + historical data
Allspaw's Wisdom
Allspaw's Wisdom
System architecture can affect your
ability to add capacity
Identify & track your application’s
metrics
Tying metrics to user behavior is helpful
If you don’t have ways to measure
your current capacity you can’t plan
Little’s Law & Capacity planning
L = λW
Capacity (L), Throughput (λ),
and Latency (W)
Applies to stable systems
Use this information to better
understand our workload and to
define constraints
Literature Insights
Possible to have plenty of capacity and
a slow site nonetheless
Projections & curve fitting are guesses
Keep track of API calls & their rate
Always gonna be spikes & hiccups.
Take the bad with the good & plan for it
Rituals
&
Myths
Crowdsourcing Capacity planning
Crowdsourcing Capacity planning
Industry Insights
Hard to extrapolate general
advice into something
applicable for my situation
Simplicity & ability to reason are
the only things I could trust
Confusing community stance on
the ROI of capacity planning
& Putting things in practice
Findings
Step One Step Two
steps followed
Documented system
architecture &
request lifecycle
Formalized: clients,
SLAs, & operational
requirements
Discovery
Confirmed constraints
& determined strategy
Parallelized capacity
& optimizations tasks
Organized a team
Gauging & Planning
Edge
Core APP / API APP / API
LB LB
COORDINATOR A COORDINATOR B COORDINATOR C
🐤
CACHE
LON
CACHE
DFW
CACHE
FRA
CACHE
LAX
CACHE
AMS
CACHE
SYD
REQUEST flow
📄 📄 📄👉
Step Four
steps followed
Start process again
Tons of tuning left to
do. We know we
have suboptimal
configs!
re-Evaluation
Step Three
Doubled RAM: our
constrained resource
Horizontally scaled to 3
servers + 1 canary
Capacity expansion
System Before
System After
System Before System After
System Before System After
Unexpected Challenges
Our goal when adding capacity
was no service disruption.
Localhost is the goddamn devil
Gap from metric/graph to
insight can be huge
Slowness is the nemesis of
distributed system
The Oprah Problem
Developing operational
insights into non-owned
system under pressure is
not great
Use playbooks,
debug.md, rotations, &
rollout owners
Proactivity and clarity
are your best tools
Everyone
gets more
capacity!
Some Insights
Anything API driven ought to
carry a rate limit - We can
easily DDOS ourselves!
Monitor and alert on
expensive API actions
Mind your system
dependencies: practice
defensive system design &
architecture
CAPACITY
PLANNING
ALERTING
MONITORING
Some Findings
Capacity tied to murky
organizational structure
is both good & bad
(but mostly bad)
Mind your error
descriptions! Cheeky
today ⇒ misleading
tomorrow!
Finding my system’s ceiling is still tricky
Services owned by engineers means
you need to level up on Ops skills
Back to re-evaluate setup to get more
out of this new capacity
Performance testing ought to be done
on the core’s side (& edge)
My Insights
TL;DR
Is a process not a one
time event
Pushes you to better
understand your
system, its capacity &
its boundaries - that is
good!
Proactivity is best
Capacity planning
Request lifecycle gets
tricky
System boundaries,
dependencies & SLAs
must be discussed
Your system’s capacity
may bound other
systems capacity
Distributed systems
github.com/Randommood/ZerotoCapacityPlanning
Special Thanks to: Catharine Strauss,
Alan Kasindorf, Matt Whiteley,
Caitie McCaffrey, Thom Mahoney,
Mike O’Neill, Devon O’Dell,
Katherine Daniels, Nathan Taylor,
Bruce Spang, and Greg Bako
Thank you !
github.com/Randommood/ZerotoCapacityPlanning

From Zero to Capacity Planning

  • 1.
  • 2.
  • 3.
  • 4.
    Whycapacity planning? Or a journeyof discovery and ingenuity
  • 5.
    The views reflectedin this talk are not to be considered a reflection of the skills of my coworkers who are extremely nice human beings and way better at capacity planning than I am. 😜 NOTAmonitoring person 💀 🚨🚨
  • 6.
  • 7.
    FindingsBooks 0 Day One Some Learning OurDiscoveries Rituals &Myths Asking Around Bringing it Home our Path today Checking The Edge
  • 8.
  • 9.
  • 10.
  • 11.
  • 12.
    I Rule the Edge! Evaluatesweekly global POPs performance & makes projections Publishes capacity performance report in clear location Plans for our physical capacity & transit capacity Meet Catharine
  • 13.
    Planning Our Capacity Somemetrics - Network Capacity (Gb) 
 - Ordered Network Capability (Gb) 
 - Planned Network Capacity (Gb)
 - RPS Capacity (k) 
 - Network peak (Gb) 
 - RPS peak (k) 
 - Site CPU Peak (%) 
 - Network Utilization (%) Over 30%: flagged, Over 70%: Red status
  • 14.
    Edge Insights Our abilityto correctly plan for capacity is critical to our bottom line Capacity doesn’t just involve hardware; software optimizations matter People affect capacity
  • 15.
  • 16.
    Defining Capacity planning Measuring,planning, & managing system growth Determines what your system needs & when From the observation of actual traffic. Use current performance as baseline. Must happen regardless of what you might optimize
  • 17.
    ARE WE RIGHT NOW? We haveto be this fast & reliable 
 X per second & Y% Uptime MEASURE HOW/RELIABLE WE ARE HARDWARE SOFTWARE ARCHITECTURE CHANGE / ADD / REMOVE FIGURE OUT HOW TO STAY FAST/RELIABLE ENOUGH Yes! No! Allspaw's Wisdom From The Art of Capacity Planning 👈
  • 18.
    System’s Ceiling: criticallevel of a resource that cannot be crossed without failure. Find yours Another form of Capacity Planning: Controlled load testing Predictions: ceilings + historical data Allspaw's Wisdom
  • 19.
    Allspaw's Wisdom System architecturecan affect your ability to add capacity Identify & track your application’s metrics Tying metrics to user behavior is helpful If you don’t have ways to measure your current capacity you can’t plan
  • 20.
    Little’s Law &Capacity planning L = λW Capacity (L), Throughput (λ), and Latency (W) Applies to stable systems Use this information to better understand our workload and to define constraints
  • 21.
    Literature Insights Possible tohave plenty of capacity and a slow site nonetheless Projections & curve fitting are guesses Keep track of API calls & their rate Always gonna be spikes & hiccups. Take the bad with the good & plan for it
  • 22.
  • 23.
  • 24.
  • 25.
    Industry Insights Hard toextrapolate general advice into something applicable for my situation Simplicity & ability to reason are the only things I could trust Confusing community stance on the ROI of capacity planning
  • 26.
    & Putting thingsin practice Findings
  • 27.
    Step One StepTwo steps followed Documented system architecture & request lifecycle Formalized: clients, SLAs, & operational requirements Discovery Confirmed constraints & determined strategy Parallelized capacity & optimizations tasks Organized a team Gauging & Planning
  • 28.
    Edge Core APP /API APP / API LB LB COORDINATOR A COORDINATOR B COORDINATOR C 🐤 CACHE LON CACHE DFW CACHE FRA CACHE LAX CACHE AMS CACHE SYD REQUEST flow 📄 📄 📄👉
  • 29.
    Step Four steps followed Startprocess again Tons of tuning left to do. We know we have suboptimal configs! re-Evaluation Step Three Doubled RAM: our constrained resource Horizontally scaled to 3 servers + 1 canary Capacity expansion
  • 30.
  • 31.
  • 32.
  • 33.
  • 34.
    Unexpected Challenges Our goalwhen adding capacity was no service disruption. Localhost is the goddamn devil Gap from metric/graph to insight can be huge Slowness is the nemesis of distributed system
  • 35.
    The Oprah Problem Developingoperational insights into non-owned system under pressure is not great Use playbooks, debug.md, rotations, & rollout owners Proactivity and clarity are your best tools Everyone gets more capacity!
  • 36.
    Some Insights Anything APIdriven ought to carry a rate limit - We can easily DDOS ourselves! Monitor and alert on expensive API actions Mind your system dependencies: practice defensive system design & architecture CAPACITY PLANNING ALERTING MONITORING
  • 37.
    Some Findings Capacity tiedto murky organizational structure is both good & bad (but mostly bad) Mind your error descriptions! Cheeky today ⇒ misleading tomorrow!
  • 38.
    Finding my system’sceiling is still tricky Services owned by engineers means you need to level up on Ops skills Back to re-evaluate setup to get more out of this new capacity Performance testing ought to be done on the core’s side (& edge) My Insights
  • 39.
    TL;DR Is a processnot a one time event Pushes you to better understand your system, its capacity & its boundaries - that is good! Proactivity is best Capacity planning Request lifecycle gets tricky System boundaries, dependencies & SLAs must be discussed Your system’s capacity may bound other systems capacity Distributed systems
  • 40.
    github.com/Randommood/ZerotoCapacityPlanning Special Thanks to:Catharine Strauss, Alan Kasindorf, Matt Whiteley, Caitie McCaffrey, Thom Mahoney, Mike O’Neill, Devon O’Dell, Katherine Daniels, Nathan Taylor, Bruce Spang, and Greg Bako Thank you !
  • 41.