Capacity Management and Planning:
Data Science, Queueing, Optimization
and other good things
Alex Gilgur, PhD
Principal Data Scientist
Abstract
Capacity Management (CapMan) and Planning (CapPlan) are two sides of the same coin –
ensuring that we can deliver the best experience to the customer at the lowest cost to us.
It is traditionally the most operations-research (OR) - heavy side of any technical
infrastructure.
This talk dives into capacity management and planning, why we do it, and how we do it.
2
Content
1. CapPlan and CapMan
a. Why Manage Capacity
b. What’s the difference between Planning and Management?
c. Three Steps of Capacity Planning
2. Capacity Management == Supply-Chain Management
a. Supply-Side Capacity
b. Demand-Side Capacity: Requirements
c. Utilization: Pros and Cons
3. Statistical Process Control
a. Setting Specifications
b. SPC
4. Queueing Fundamentals
a. Palm-Khintchine Theorem
b. Little’s Law
c. Network Queueing
5. Supply ≥ Demand
a. Optimization
6. Forecasting for Capacity Planning
a. Motivation & Methods
b. Methods: a 30,000-ft View
c. Ensemble Forecasting: Prediction Intervals
d. Model Monitoring
7. What we Covered
3
Why Manage Capacity
1. Capacity is expensive
2. Capacity takes time to deploy
3. Capacity => QoS => Customer Retention => Revenue
(Source)
4
Capacity Management in the “Big Picture” of Service Delivery
Capacity
Management
Determine Service
Level Requirements
(SLAs & SLOs)
Analyze Current
Capacity
Plan for the Future
(Source)
5
Capacity Management in the “Big Picture” of Service Delivery
Capacity
Management
1. Determine Service
Level Requirements
(SLAs & SLOs)
2. Analyze Current
Capacity
3. Plan for the Future
(Source)
From SLAs to SLOs
6
Capacity Management in the “Big Picture” of Service Delivery
Capacity
Management
Determine Service
Level Requirements
(SLAs & SLOs)
Analyze Current
Capacity
Plan for the Future
(Source)
From SLAs to SLOs
Simulation Optimization
7
Capacity Management in the “Big Picture” of Service Delivery
Capacity
Management
Determine Service
Level Requirements
(SLAs & SLOs)
Analyze Current
Capacity
Plan for the Future
(Source)
From SLAs to SLOs
Simulation Optimization
Forecasting
8
Why Manage Capacity?
9
(Source)
Causes of Bullwhip Effect
10
Demo
Supply-Chain Management and Capacity Management
Performance
Monitoring
Planning
Supply-Chain Management is a 5-stage process with
a feedback loop. Defective products tell us how well
we are performing.
The process is fractal: at each stage, we follow the
same virtuous circle of
Monitor=>
Analyze=>
Model=>
Optimize=>
Change=>Monitor
11
Supply-Chain Management and Capacity Management
Performance
Monitoring
Planning
Payload
Network
Performance
Monitoring
Supply-Chain Management is a 5-stage process with
a feedback loop. Defective products tell us how well
we are performing.
The process is fractal: at each stage, we follow the
same virtuous circle of
Monitor=>
Analyze=>
Model=>
Optimize=>
Change=>Monitor
Supply-Side
Capacity
Demand-
Side:
Planning
12
Stages of
CAPACITY
MANAGEMENT
Supply-Chain Management and Capacity Management
Performance
Monitoring
Planning
Payload
Network
Performance
Monitoring
Supply-Chain Management is a 5-stage process with
a feedback loop. Defective products tell us how well
we are performing.
The process is fractal: at each stage, we follow the
same virtuous circle of
Monitor=>
Analyze=>
Model=>
Optimize=>
Change=>Monitor
Similarly, Capacity Management is a 5-stage
process with a feedback loop.
We monitor performance as the signal for adding
or reducing capacity - both in real time and in
long-range plans.
Supply-Side
Capacity
Demand-
Side:
Planning
13
Stages of
CAPACITY
MANAGEMENT
Supply-Chain Management and Capacity Management
Performance
Monitoring
Planning
Similarly, Capacity Management is a 5-stage
process with a feedback loop.
We monitor performance as the signal for adding
or reducing capacity - both in real time and in
long-range plans.
At each stage, we do fault and performance
monitoring and control, along with root-cause
analysis (RCA)
Supply-Side
Capacity
Payload
Network
Performance
Monitoring
Demand-
Side:
Planning
Supply-Chain Management is a 5-stage process with
a feedback loop. Defective products tell us how well
we are performing.
The process is fractal: at each stage, we follow the
same virtuous circle of
Monitor=>
Analyze=>
Model=>
Optimize=>
Change=>Monitor
14
Stages of
CAPACITY
MANAGEMENT
Computing Capacity Requirements: Utilization: Pros and Cons
15
Utilization is a misleading metric for capacity planning & management
=> got to use Queueing Models - closed-form equations or simulation
(Source)
Demand Side: Capacity Requirements
Service-Level
Headroom
Service-Level
Demand
Low-Level
Headroom
Low-Level
Demand
Low-Level
Headroom
Low-Level
Demand
Low-Level
Headroom
Low-Level
Demand
16
Computing Capacity Requirements
17
Compute
Demand
Compute
Headroom
Estimate
Capacity
All Good
Need More
ID Anomalies in Demand
ID Anomalies in Capacity
Erlang models is how it all started.
They make assumptions about
throughput and service times that do
not always hold true.
More generic closed-form models
exist nowadays
Capacity Visibility is the key to successful capacity management
Capacity Visibility. Statistical Process Control: What is SPC?
18
This is NOT Statistical Process Control.
And it is NOT Monitoring either.
This is Statistical Process Control:
● Upstream & Downstream Dependencies Modeled
● Metric Defined
○ Anomalies Defined
○ Specification Limits Set
● Control Limits Computed & Tracked
● Methods to Adjust the Processes Exist
(Source)
Foundations of Statistical Process Control
19
Definitions
● QoE - Quality of Experience
● QoS - Quality of Service
● Metric - a tuple of:
○ Name,
○ Source,
○ Aggregation level
○ Specifications (aka SLO)
■ LSL (Low Spec Limit)
■ Tgt (Target Value)
■ USL (Upper Spec Limit)
● SLA - Service Level Agreement - our contractual obligations to the customers
● Control Limits - bounds of the stationary distribution of the metric - a tuple of
○ LCL(Low Control Limit),
○ μ (Central Value),
○ HCL (High Control Limit)
QoE Metric Examples:
● Throughput
● Latency
● Pixelation
● ... ... ...
QoS Metric Examples:
● Throughput
● Latency
● Jitter
● Packet Loss
● Device Temperature
● CPU Utilization
● Queue Fill Factor
● ... ... ...
https://edmondbusiness.com/2021/02/gentlemen-this-is-a-football/
SPC: Defining Anomalies
20
Rules-Based Statistical
https://en.wikipedia.org/wiki/Nelson_rules
https://en.wikipedia.org/wiki/Western_Electric_rules
How Significant are these Anomalies?
We need SPECIFICATIONS
SPC: Setting Specifications: High-Level Workflow
21
Models:
● Revenue = f(UX)
● Throughput = f(...)
● Latency = f(...)
Constraints:
● Latency >= X msec
● Throughput <= Y Mbps
● Availability <= P%
UX = F(Throughput, Latency, ...)
Stakeholders
Are these numbers good?
Throughput Latency
UX
SPC: Setting Specifications: a 30,000-ft View
22
From SLAs to SLOs
SPC: Setting Specifications: Details
23
SPC: Tracking Control Limits: SPC Measures
24
Z
st
= 2: Z
lt
= 0.5: C
pk
= 0.67: 310,000 defects per 1,000,000 opportunities (68.2%);
Z
st
= 3: Z
lt
= 1.5: C
pk
= 1.0: 67,000 defects per 1,000,000 opportunities (93.3%)
Z
st
= 6: Z
lt
= 4.5: C
pk
= 2.0: 3.45 defects per 1,000,000 opportunities (99.999965%)
Source:
https://www.six-sigma-material.com/Tables.html
SPC: Methods to Adjust the Processes
25
Causal Inference
● We have the model built in setting the specs
● We know which knobs are important
● We can play what-if scenarios to keep the target
metrics in specs and in control
● We generate feedback to control mechanisms
Bayesian Inference
● We do not have the model
● We do not know which knobs are important
● We infer from statistical distributions using
Bayesian inference procedure which knobs are
the likely culprits of the problem
Statistical Process Control Workflow
26
Statistical Process
Control (SPC)
Fault Detection,
Identification, &
Recovery (FDIR)
Feedback /
Feedforward
Control
Availability &
Resilience
Evaluation
Persisted
Data
Data
Streaming
Bus
BizOps
Specifications
Sampling for SPC
27
Information
BW
Usage
Problem Statement: Entropy:
Maximize information (minimize entropy)
subject to BW constraints
Maximize information (minimize entropy)
while minimizing BW usage
Solution: Example:
Where were we?
1. CapPlan and CapMan
a. Why Manage Capacity
b. What’s the difference between Planning and Management?
c. Three Steps of Capacity Planning
2. Capacity Management == Supply-Chain Management
a. Supply-Side Capacity
b. Demand-Side Capacity: Requirements
c. Utilization: Pros and Cons
3. Statistical Process Control
a. Setting Specifications
b. SPC
4. Queueing Fundamentals
a. Palm-Khintchine Theorem
b. Little’s Law
c. Network Queueing
5. Supply ≥ Demand
a. Optimization
6. Forecasting for Capacity Planning
a. Motivation & Methods
b. Methods: a 30,000-ft View
c. Ensemble Forecasting: Prediction Intervals
d. Model Monitoring
7. Conclusions
28
Queueing Fundamentals
What’s happening here?
In 2008, Nagoya University researchers built a traffic circle and
asked all drivers to maintain a constant speed of 30 km/h. After
a few cycles, local traffic jam shockwaves started to appear.
The shockwaves traveled back at the speed of 20 km/h.
Constant Utilization
Plenty of room between the cars
Random variations in speed along
the track cause congestion.
(More to Explore)
Sum of the 25 sets
Palm-Khintchine Theorem
A large number of renewal processes,
will, in the sum, converge to a Poisson
distribution, regardless of their
individual distributions.
(The code is here)
25 sets of 2000 random numbers
Gaussian
Exponential
Gamma
Uniform
distributions
29
Why is Palm-Khintchine Theorem Important?
What’s happening here?
(More to Explore)
25 sets of 2000 random numbers
Gaussian
Exponential
Gamma
Uniform
distributions
Sum of the 25 sets
Palm-Khintchine Theorem
A large number of renewal
processes, will, in the sum,
converge to a Poisson
distribution, regardless of their
individual distributions.
1. It is a limit theorem
2. It explains statistical multiplexing
3. It justifies closed-form solution to queueing models - when
we have many demands converging into one.
4. If we converge the throughput from a node’s incoming
links, we may or may not have congestion. This depends
on the latency - time that packets spend on the node in
question.
Node 5 is a SPOF, but it may
not be congested if it is
properly sized, such that the
number of packets coming in
does not exceed the number of
packets in the queue plus the
number of packets going out.
5
30
5
7
10
19
Why is Number of Packets, not Packet Rate, Important?
It is the packets that get stuck
Little’s Law
Capacity: we want to let the WIP all go through
But we need to know the queueing delay => “It’s complicated”
In network operations, nominal delay
is aka propagation delay
Demand Capacity
31
Network Path is a Markov Chain of Queues
A Z
B C ...
Jitter (latency variation) is the “silent killer” of QoS
Delay at previous node
contributes to inter-arrival
time (IAT) at current node
● As packets move down network path, they experience propagation and queueing delay.
● As we saw in Nagoya experiment, there will be a variability in WIP.
● Little’s Law describes it for stationary systems.
● For non-stationary systems, we can use the differential form of Little’s law:
Queueing delay as a function of utilization (normalized
throughput) and number of parallel channels (servers) (Source)
It gets complicated fast
32
Network Path in a Network of Queues
A Z
B C ...
A’
Now the WIP at node i (here i == C) is the sum of WIPs
coming in and the queueing WIP on node C, which in turn is a
function of total throughput going through node C.
Routing
33
Statmuxing
Queueing delay as a function of utilization (normalized
throughput) and number of parallel channels (servers) (Source)
Queueing
Got to account for statmuxing, queueing, and routing.
Where were we?
1. CapPlan and CapMan
a. Why Manage Capacity
b. What’s the difference between Planning and Management?
c. Three Steps of Capacity Planning
2. Capacity Management == Supply-Chain Management
a. Supply-Side Capacity
b. Demand-Side Capacity: Requirements
c. Utilization: Pros and Cons
3. Statistical Process Control
a. Setting Specifications
b. SPC
4. Queueing Fundamentals
a. Palm-Khintchine Theorem
b. Little’s Law
c. Network Queueing
5. Supply ≥ Demand
a. Optimization
6. Forecasting for Capacity Planning
a. Motivation & Methods
b. Methods: a 30,000-ft View
c. Ensemble Forecasting: Prediction Intervals
d. Model Monitoring
7. Conclusions
34
Why do we Optimize Capacity?
1. Demand should NOT Exceed Supply -- or else:
a. Loss of SLAs
b. Loss of customers
c. No Room for Events (e.g., NYE; Cyber Monday; Olympics / World Cup / Superbowl; Burning Man; ...)
2. Constraints:
a. Budget
i. Capacity is expensive to very expensive
b. Operations:
i. Statmuxing
ii. Queueing
iii. Routing
c. Performance:
i. Stochastic demand
ii. Jitter
iii. Reliability of Subsystems and Components
3. We want to:
a. Select the right Objective (Cost or Utility) Function
b. Correctly list all Constraints
c. Use the right Solver
d. Expect the unexpected
35
In the interest of time, we are not covering optimization
techniques in this talk. An incomplete list of traditional
capacity-management methods includes: Bin Packing; Simplex;
Dijkstra SPF (OSPF / CSPF); Genetic Optimization; and others.
Forecasting for Capacity Planning: Motivation & Approaches
36
1. Strategic forecasting - Long-Range Plans (LRPs):
a. What will the demand be X years from now?
b. What will the supply be X years from now?
2. Tactical forecasting - Mid-Range (usually PoR)
a. What will the demand be Y months / quarters from now?
b. What will the supply be Y months / quarters from now?
3. Operational forecasting - Short-Range
a. What will the demand be Z days / hours / minutes from now?
b. What will the supply be Z days / hours / minutes from now?
High aggregation;
Low precision
Upper Bound of Demand;
Lower Bound of Supply
High aggregation;
Medium precision
Range of Demand;
Range of Supply
Low aggregation;
High precision
Range of Demand;
Range of Supply
Individual
Entity
Demand
Prediction
Interval
Boundaries
Sum of the
Demand
Probabilistic
Forecast
Hyperparameter
Tuning
Monte-Carlo
Forecasting Methods: a 30,000-ft View
37
(Source)
● Model-Based (Causal):
○ We know the causal variables
○ We have causal variables’ forecasts
○ We can build a causal model (ML or Simulation)
● Time-Series Analysis:
○ We know that previous behavior will continue
○ We don’t have explanatory metrics for target
variable
● Ensemble:
○ “A collection of weaker models combined will yield
a more powerful model.” (Source)
○ Watch out for Prediction Intervals!!!
NEVER TRUST A POINT FORECAST!!!
Ensemble Forecasting: Prediction Intervals
38
(Source)
Watch out for Prediction Intervals!!!
For non-Gaussian distributions, the ratio will not be 1.96.
Compute it on the dataset - unless it is a Cauchy or Power-Law
distribution. Then all bets are off - unless we use Monte-Carlo.
All model predictions in ensemble are mutually independent
Forecasting: Model Monitoring
39
(Source)
Start
Collect data;
Train model;
Generate forecast
Collect N < Horizon
data points
Compute
Forecast Quality Metrics on new data
SPC
What we Covered
1. Capacity Is Expensive and Complicated. Requires careful planning and management
2. Supply-Chain management principles apply to Capacity Management
3. Statistical Process Control > Monitoring > Dashboarding
4. Queueing Math Works and is Useful
a. Statmuxing
b. Queueing
c. Routing
5. We Optimize Operations and Capacity to keep Supply ≥ Demand
6. Forecasting is a Critical Element of Capacity Planning and Management
40
Key Takeaways:
1. Forecasting, Queueing, Statistical Process Control, and Optimization are Key Elements of Capacity Planning and Capacity Management
2. Local Capacity Management is Meaningless -- Need End to End and Across Layers
3. Capacity Management and Planning relies on models. For models, GIGO is the guiding principle => for CapMan & CapPlan GIGO holds true.
Thank you!!!
41
alexgilgur@gmail.com
https://www.linkedin.com/in/alexgilgur/
Google Scholar page
Appendix
42
Capacity Requirements: Back in Time
43
How it all started
(Source)

Capacity Management and Planning_ Data Science, Queueing, Optimization and other good things.pdf

  • 1.
    Capacity Management andPlanning: Data Science, Queueing, Optimization and other good things Alex Gilgur, PhD Principal Data Scientist
  • 2.
    Abstract Capacity Management (CapMan)and Planning (CapPlan) are two sides of the same coin – ensuring that we can deliver the best experience to the customer at the lowest cost to us. It is traditionally the most operations-research (OR) - heavy side of any technical infrastructure. This talk dives into capacity management and planning, why we do it, and how we do it. 2
  • 3.
    Content 1. CapPlan andCapMan a. Why Manage Capacity b. What’s the difference between Planning and Management? c. Three Steps of Capacity Planning 2. Capacity Management == Supply-Chain Management a. Supply-Side Capacity b. Demand-Side Capacity: Requirements c. Utilization: Pros and Cons 3. Statistical Process Control a. Setting Specifications b. SPC 4. Queueing Fundamentals a. Palm-Khintchine Theorem b. Little’s Law c. Network Queueing 5. Supply ≥ Demand a. Optimization 6. Forecasting for Capacity Planning a. Motivation & Methods b. Methods: a 30,000-ft View c. Ensemble Forecasting: Prediction Intervals d. Model Monitoring 7. What we Covered 3
  • 4.
    Why Manage Capacity 1.Capacity is expensive 2. Capacity takes time to deploy 3. Capacity => QoS => Customer Retention => Revenue (Source) 4
  • 5.
    Capacity Management inthe “Big Picture” of Service Delivery Capacity Management Determine Service Level Requirements (SLAs & SLOs) Analyze Current Capacity Plan for the Future (Source) 5
  • 6.
    Capacity Management inthe “Big Picture” of Service Delivery Capacity Management 1. Determine Service Level Requirements (SLAs & SLOs) 2. Analyze Current Capacity 3. Plan for the Future (Source) From SLAs to SLOs 6
  • 7.
    Capacity Management inthe “Big Picture” of Service Delivery Capacity Management Determine Service Level Requirements (SLAs & SLOs) Analyze Current Capacity Plan for the Future (Source) From SLAs to SLOs Simulation Optimization 7
  • 8.
    Capacity Management inthe “Big Picture” of Service Delivery Capacity Management Determine Service Level Requirements (SLAs & SLOs) Analyze Current Capacity Plan for the Future (Source) From SLAs to SLOs Simulation Optimization Forecasting 8
  • 9.
  • 10.
    Causes of BullwhipEffect 10 Demo
  • 11.
    Supply-Chain Management andCapacity Management Performance Monitoring Planning Supply-Chain Management is a 5-stage process with a feedback loop. Defective products tell us how well we are performing. The process is fractal: at each stage, we follow the same virtuous circle of Monitor=> Analyze=> Model=> Optimize=> Change=>Monitor 11
  • 12.
    Supply-Chain Management andCapacity Management Performance Monitoring Planning Payload Network Performance Monitoring Supply-Chain Management is a 5-stage process with a feedback loop. Defective products tell us how well we are performing. The process is fractal: at each stage, we follow the same virtuous circle of Monitor=> Analyze=> Model=> Optimize=> Change=>Monitor Supply-Side Capacity Demand- Side: Planning 12 Stages of CAPACITY MANAGEMENT
  • 13.
    Supply-Chain Management andCapacity Management Performance Monitoring Planning Payload Network Performance Monitoring Supply-Chain Management is a 5-stage process with a feedback loop. Defective products tell us how well we are performing. The process is fractal: at each stage, we follow the same virtuous circle of Monitor=> Analyze=> Model=> Optimize=> Change=>Monitor Similarly, Capacity Management is a 5-stage process with a feedback loop. We monitor performance as the signal for adding or reducing capacity - both in real time and in long-range plans. Supply-Side Capacity Demand- Side: Planning 13 Stages of CAPACITY MANAGEMENT
  • 14.
    Supply-Chain Management andCapacity Management Performance Monitoring Planning Similarly, Capacity Management is a 5-stage process with a feedback loop. We monitor performance as the signal for adding or reducing capacity - both in real time and in long-range plans. At each stage, we do fault and performance monitoring and control, along with root-cause analysis (RCA) Supply-Side Capacity Payload Network Performance Monitoring Demand- Side: Planning Supply-Chain Management is a 5-stage process with a feedback loop. Defective products tell us how well we are performing. The process is fractal: at each stage, we follow the same virtuous circle of Monitor=> Analyze=> Model=> Optimize=> Change=>Monitor 14 Stages of CAPACITY MANAGEMENT
  • 15.
    Computing Capacity Requirements:Utilization: Pros and Cons 15 Utilization is a misleading metric for capacity planning & management => got to use Queueing Models - closed-form equations or simulation (Source)
  • 16.
    Demand Side: CapacityRequirements Service-Level Headroom Service-Level Demand Low-Level Headroom Low-Level Demand Low-Level Headroom Low-Level Demand Low-Level Headroom Low-Level Demand 16
  • 17.
    Computing Capacity Requirements 17 Compute Demand Compute Headroom Estimate Capacity AllGood Need More ID Anomalies in Demand ID Anomalies in Capacity Erlang models is how it all started. They make assumptions about throughput and service times that do not always hold true. More generic closed-form models exist nowadays Capacity Visibility is the key to successful capacity management
  • 18.
    Capacity Visibility. StatisticalProcess Control: What is SPC? 18 This is NOT Statistical Process Control. And it is NOT Monitoring either. This is Statistical Process Control: ● Upstream & Downstream Dependencies Modeled ● Metric Defined ○ Anomalies Defined ○ Specification Limits Set ● Control Limits Computed & Tracked ● Methods to Adjust the Processes Exist (Source)
  • 19.
    Foundations of StatisticalProcess Control 19 Definitions ● QoE - Quality of Experience ● QoS - Quality of Service ● Metric - a tuple of: ○ Name, ○ Source, ○ Aggregation level ○ Specifications (aka SLO) ■ LSL (Low Spec Limit) ■ Tgt (Target Value) ■ USL (Upper Spec Limit) ● SLA - Service Level Agreement - our contractual obligations to the customers ● Control Limits - bounds of the stationary distribution of the metric - a tuple of ○ LCL(Low Control Limit), ○ μ (Central Value), ○ HCL (High Control Limit) QoE Metric Examples: ● Throughput ● Latency ● Pixelation ● ... ... ... QoS Metric Examples: ● Throughput ● Latency ● Jitter ● Packet Loss ● Device Temperature ● CPU Utilization ● Queue Fill Factor ● ... ... ... https://edmondbusiness.com/2021/02/gentlemen-this-is-a-football/
  • 20.
    SPC: Defining Anomalies 20 Rules-BasedStatistical https://en.wikipedia.org/wiki/Nelson_rules https://en.wikipedia.org/wiki/Western_Electric_rules How Significant are these Anomalies? We need SPECIFICATIONS
  • 21.
    SPC: Setting Specifications:High-Level Workflow 21 Models: ● Revenue = f(UX) ● Throughput = f(...) ● Latency = f(...) Constraints: ● Latency >= X msec ● Throughput <= Y Mbps ● Availability <= P% UX = F(Throughput, Latency, ...) Stakeholders Are these numbers good? Throughput Latency UX
  • 22.
    SPC: Setting Specifications:a 30,000-ft View 22 From SLAs to SLOs
  • 23.
  • 24.
    SPC: Tracking ControlLimits: SPC Measures 24 Z st = 2: Z lt = 0.5: C pk = 0.67: 310,000 defects per 1,000,000 opportunities (68.2%); Z st = 3: Z lt = 1.5: C pk = 1.0: 67,000 defects per 1,000,000 opportunities (93.3%) Z st = 6: Z lt = 4.5: C pk = 2.0: 3.45 defects per 1,000,000 opportunities (99.999965%) Source: https://www.six-sigma-material.com/Tables.html
  • 25.
    SPC: Methods toAdjust the Processes 25 Causal Inference ● We have the model built in setting the specs ● We know which knobs are important ● We can play what-if scenarios to keep the target metrics in specs and in control ● We generate feedback to control mechanisms Bayesian Inference ● We do not have the model ● We do not know which knobs are important ● We infer from statistical distributions using Bayesian inference procedure which knobs are the likely culprits of the problem
  • 26.
    Statistical Process ControlWorkflow 26 Statistical Process Control (SPC) Fault Detection, Identification, & Recovery (FDIR) Feedback / Feedforward Control Availability & Resilience Evaluation Persisted Data Data Streaming Bus BizOps Specifications
  • 27.
    Sampling for SPC 27 Information BW Usage ProblemStatement: Entropy: Maximize information (minimize entropy) subject to BW constraints Maximize information (minimize entropy) while minimizing BW usage Solution: Example:
  • 28.
    Where were we? 1.CapPlan and CapMan a. Why Manage Capacity b. What’s the difference between Planning and Management? c. Three Steps of Capacity Planning 2. Capacity Management == Supply-Chain Management a. Supply-Side Capacity b. Demand-Side Capacity: Requirements c. Utilization: Pros and Cons 3. Statistical Process Control a. Setting Specifications b. SPC 4. Queueing Fundamentals a. Palm-Khintchine Theorem b. Little’s Law c. Network Queueing 5. Supply ≥ Demand a. Optimization 6. Forecasting for Capacity Planning a. Motivation & Methods b. Methods: a 30,000-ft View c. Ensemble Forecasting: Prediction Intervals d. Model Monitoring 7. Conclusions 28
  • 29.
    Queueing Fundamentals What’s happeninghere? In 2008, Nagoya University researchers built a traffic circle and asked all drivers to maintain a constant speed of 30 km/h. After a few cycles, local traffic jam shockwaves started to appear. The shockwaves traveled back at the speed of 20 km/h. Constant Utilization Plenty of room between the cars Random variations in speed along the track cause congestion. (More to Explore) Sum of the 25 sets Palm-Khintchine Theorem A large number of renewal processes, will, in the sum, converge to a Poisson distribution, regardless of their individual distributions. (The code is here) 25 sets of 2000 random numbers Gaussian Exponential Gamma Uniform distributions 29
  • 30.
    Why is Palm-KhintchineTheorem Important? What’s happening here? (More to Explore) 25 sets of 2000 random numbers Gaussian Exponential Gamma Uniform distributions Sum of the 25 sets Palm-Khintchine Theorem A large number of renewal processes, will, in the sum, converge to a Poisson distribution, regardless of their individual distributions. 1. It is a limit theorem 2. It explains statistical multiplexing 3. It justifies closed-form solution to queueing models - when we have many demands converging into one. 4. If we converge the throughput from a node’s incoming links, we may or may not have congestion. This depends on the latency - time that packets spend on the node in question. Node 5 is a SPOF, but it may not be congested if it is properly sized, such that the number of packets coming in does not exceed the number of packets in the queue plus the number of packets going out. 5 30 5 7 10 19
  • 31.
    Why is Numberof Packets, not Packet Rate, Important? It is the packets that get stuck Little’s Law Capacity: we want to let the WIP all go through But we need to know the queueing delay => “It’s complicated” In network operations, nominal delay is aka propagation delay Demand Capacity 31
  • 32.
    Network Path isa Markov Chain of Queues A Z B C ... Jitter (latency variation) is the “silent killer” of QoS Delay at previous node contributes to inter-arrival time (IAT) at current node ● As packets move down network path, they experience propagation and queueing delay. ● As we saw in Nagoya experiment, there will be a variability in WIP. ● Little’s Law describes it for stationary systems. ● For non-stationary systems, we can use the differential form of Little’s law: Queueing delay as a function of utilization (normalized throughput) and number of parallel channels (servers) (Source) It gets complicated fast 32
  • 33.
    Network Path ina Network of Queues A Z B C ... A’ Now the WIP at node i (here i == C) is the sum of WIPs coming in and the queueing WIP on node C, which in turn is a function of total throughput going through node C. Routing 33 Statmuxing Queueing delay as a function of utilization (normalized throughput) and number of parallel channels (servers) (Source) Queueing Got to account for statmuxing, queueing, and routing.
  • 34.
    Where were we? 1.CapPlan and CapMan a. Why Manage Capacity b. What’s the difference between Planning and Management? c. Three Steps of Capacity Planning 2. Capacity Management == Supply-Chain Management a. Supply-Side Capacity b. Demand-Side Capacity: Requirements c. Utilization: Pros and Cons 3. Statistical Process Control a. Setting Specifications b. SPC 4. Queueing Fundamentals a. Palm-Khintchine Theorem b. Little’s Law c. Network Queueing 5. Supply ≥ Demand a. Optimization 6. Forecasting for Capacity Planning a. Motivation & Methods b. Methods: a 30,000-ft View c. Ensemble Forecasting: Prediction Intervals d. Model Monitoring 7. Conclusions 34
  • 35.
    Why do weOptimize Capacity? 1. Demand should NOT Exceed Supply -- or else: a. Loss of SLAs b. Loss of customers c. No Room for Events (e.g., NYE; Cyber Monday; Olympics / World Cup / Superbowl; Burning Man; ...) 2. Constraints: a. Budget i. Capacity is expensive to very expensive b. Operations: i. Statmuxing ii. Queueing iii. Routing c. Performance: i. Stochastic demand ii. Jitter iii. Reliability of Subsystems and Components 3. We want to: a. Select the right Objective (Cost or Utility) Function b. Correctly list all Constraints c. Use the right Solver d. Expect the unexpected 35 In the interest of time, we are not covering optimization techniques in this talk. An incomplete list of traditional capacity-management methods includes: Bin Packing; Simplex; Dijkstra SPF (OSPF / CSPF); Genetic Optimization; and others.
  • 36.
    Forecasting for CapacityPlanning: Motivation & Approaches 36 1. Strategic forecasting - Long-Range Plans (LRPs): a. What will the demand be X years from now? b. What will the supply be X years from now? 2. Tactical forecasting - Mid-Range (usually PoR) a. What will the demand be Y months / quarters from now? b. What will the supply be Y months / quarters from now? 3. Operational forecasting - Short-Range a. What will the demand be Z days / hours / minutes from now? b. What will the supply be Z days / hours / minutes from now? High aggregation; Low precision Upper Bound of Demand; Lower Bound of Supply High aggregation; Medium precision Range of Demand; Range of Supply Low aggregation; High precision Range of Demand; Range of Supply Individual Entity Demand Prediction Interval Boundaries Sum of the Demand Probabilistic Forecast Hyperparameter Tuning Monte-Carlo
  • 37.
    Forecasting Methods: a30,000-ft View 37 (Source) ● Model-Based (Causal): ○ We know the causal variables ○ We have causal variables’ forecasts ○ We can build a causal model (ML or Simulation) ● Time-Series Analysis: ○ We know that previous behavior will continue ○ We don’t have explanatory metrics for target variable ● Ensemble: ○ “A collection of weaker models combined will yield a more powerful model.” (Source) ○ Watch out for Prediction Intervals!!! NEVER TRUST A POINT FORECAST!!!
  • 38.
    Ensemble Forecasting: PredictionIntervals 38 (Source) Watch out for Prediction Intervals!!! For non-Gaussian distributions, the ratio will not be 1.96. Compute it on the dataset - unless it is a Cauchy or Power-Law distribution. Then all bets are off - unless we use Monte-Carlo. All model predictions in ensemble are mutually independent
  • 39.
    Forecasting: Model Monitoring 39 (Source) Start Collectdata; Train model; Generate forecast Collect N < Horizon data points Compute Forecast Quality Metrics on new data SPC
  • 40.
    What we Covered 1.Capacity Is Expensive and Complicated. Requires careful planning and management 2. Supply-Chain management principles apply to Capacity Management 3. Statistical Process Control > Monitoring > Dashboarding 4. Queueing Math Works and is Useful a. Statmuxing b. Queueing c. Routing 5. We Optimize Operations and Capacity to keep Supply ≥ Demand 6. Forecasting is a Critical Element of Capacity Planning and Management 40 Key Takeaways: 1. Forecasting, Queueing, Statistical Process Control, and Optimization are Key Elements of Capacity Planning and Capacity Management 2. Local Capacity Management is Meaningless -- Need End to End and Across Layers 3. Capacity Management and Planning relies on models. For models, GIGO is the guiding principle => for CapMan & CapPlan GIGO holds true.
  • 41.
  • 42.
  • 43.
    Capacity Requirements: Backin Time 43 How it all started (Source)