1. Different Approaches for DifferentTasks
CMG imPACt 2016
42nd International Conference by Computer Measurement Group
Alexander Gilgur, Steve Politis
Session 361
November 08, 2016
LaJolla, CA USA
2. Steve Politis
Josue Kuri
Alex Nikolaidis
Grace Smith
Yuri Smirnov
Tyler Price
Paul Sorenson
For contributing knowledge, ideas, solutions, and support
3. Whatβs the difference
between Performance
and Capacity?
β’ Two languages in the IT world
β’ Need tools, metrics, and stats
compatible with both languages
β’ Need fluency in both languages
4. β’ Metrics:
β’ Time (latency)
β’ Rate
β’ Count:
β’ Packets in flight
β’ Packet loss
β’ A few words about %
Utilization
β’ Models:
β’ Correlations
β’ Trends in Data
β’ Time-Series Analysis
β’ Approach
β’ Top-Down?
β’ Bottom-Up?
β’ Hybrid?
β’ Measures and Aggregations:
β’ π + π β π
β’ Busy Hour/Peak Minute
β’ Nonparametric Measures:
β’ P95
β’ Outlier Boundaries
5. For Capacity Planning:
β’ Uncertainty: βRedistribution of wealthβ
β’ Distribution Looks Gaussian
β’ Washout of local anomalies
For Performance Analysis:
β’ βBig Pictureβ
β’ Immediate impact assessment
β’ Drilldown is easier than aggregation:
β’ No need to worry about which
aggregation function to choose
6. For Performance Analysis:
β’ Immediate anomaly detection
β’ Trend identification
β’ Practical Significance is unknown
For Capacity Planning:
β’ βJust the rightβ bandwidth
β’ Actual distributions & trends
β’ Time Consuming
β’ Aggregation can get complicated
7. β’ Aggregate AZ Pairs (βFlowsβ) for each service (product)β’ Aggregate services (products) for each AZ Pair
For Infra Performance Analysis:
β’ Cannot tell where the βhotβ issues are
For Capacity Planning:
β’ Can tell how much each svc (product) needs
For Infra Performance Analysis:
β’ Will ID βhotβ flows (A-Z Pairs)
For Capacity Planning:
β’ Will not find βhotβ services(products)
8. β’ Metrics:
β’ Time (latency)
β’ Rate
β’ Count:
β’ Packets in flight
β’ Packet loss
β’ A few words about %
Utilization
β’ Models:
β’ Correlations
β’ Trends in Data
β’ Time-Series Analysis
β’ Approach
β’ Top-Down?
β’ Bottom-Up?
β’ Hybrid?
β’ Measures and Aggregations:
β’ π + π β π
β’ Busy Hour/Peak Minute
β’ Nonparametric Measures:
β’ P95
β’ Outlier Boundaries
9. [π β π β π, π + π β π]
β’ The π is arbitrary
β’ Assumptions about the distribution:
β’ Mean andVariance defined
β’ Gaussian (Symmetrical)
β’ Stationary
β’ No outliers
β’ Simple math:
β’ Addition
β’ Regression
β’ TSA Forecasting
10. [π β π β π, π + π β π] [π β π π β π, π + π π β π]
ππππππππ:
With enough random samples, their
means will be Gaussian
(Central LimitTheorem)
π·ππ‘π πππππ ππππππ‘πππ:
β’ log
β’ exp
β’ Box-CoxFor Capacity Planning: For Monitoring:
11. Busy Hour / Peak Minute
Losing information:
β’ Canβt identify the dayβs outliers.
β’ Need 3+ wks. of daily peaks to measure p95.
For Performance Monitoring
For Capacity Planning
β’ ππππππ βΆ ππππ π β 0.
β’ Top-Down forecast is hard to interpret.
β’ Misses underlying services.
β’ Hides trends.
If used as an aggregated measure (Top-Down),
β’ Accurate representation of Multiplexing.
β’ Drilldown answers the βWho is hit the most?β question
β’ We size for busy-hour traffic.
β’ Snapshot of service distribution:
β’ Great forTop-Down approach.
12. Nonparametric: p95
β’ Sensitive to Aggregation Level:
β’ β π95 (π₯G) β π95(β π₯π).
β’ Have to have 20+ latest data points (π95 β‘
K
LM
)
β’ How many of these are outliers?
For Performance Monitoring For Capacity Planning
β’ Summation may lead to oversizing
β’ Ignores the bulk of the distribution
β’ We will miss SLA 5% of the time
β’ Forecasting p95: ergodicity assumption
β’ Distribution shape does not matter
β’ OK to have outliers
β’ Only 5% of data points will cause alerts β’ Easy to understand
β’ βTradition!β
β’ Math implemented in R, Python, Matlab, SAS
β’ even for regression
13. Should we Size Hardware for πππ ?
5% of the time
SLO (99.9%? 99.999%?)
will be violated
14. Shouldnβt we Size Resources for Non-Outliers Instead?
We size for SLA, as long as traffic
stays within outlier boundaries
John Tukeyβs IQR method:
πΌππ = π75 β π25
πΏππ€πππ΅ππ’ππ = π25 β π· β πΌππ
ππππππ΅ππ’ππ = π75 + π· β πΌππ
β’ π95 will βoutlawβ NON-outliers
β’ IFF π95 < ππππππ΅ππ’ππ
β’ Need a βsmartβ π·
15. Nonparametric: Outlier Boundaries
β’ Sensitive to Aggregation Level
β’ How many of these are real outliers?
For Performance Monitoring For Capacity Planning
β’ Summation may lead to oversizing
β’ Ergodicity assumption
β’ We size for non-outliers
β’ We guarantee SLA
β’ Math implemented (R, Python, Matlab, SAS)
β’ even for regression
β’ It looks at the bulk of distribution.
β’ We will NOT miss SLA 5% of the time.
β’ Distribution shape does not matter
β’ Need fewer data points than for p95
β’ Only respond to outliers
16. A Word of Caution: Outlier Boundaries
ππππ€ = 0.13 ππππ€ = 0.26
π95 < ππππππ΅ππ’ππ π95 > ππππππ΅ππ’ππ
Use ππππππ΅ππ’ππ Use π95?
Split into HI and BULK?
17. β’ There is no βone size fits allβ approach.
β’ There is no βone size fits allβ statistic.
β’ There are common principles:
β’ Aggregate before computing percentiles.
β’ Use Outlier Boundaries:
β’ Performance & Capacity:
β’ Accounts for the bulk of the data;
β’ Distribution does not matter;
β’ Performance:
β’ Easy to ID outliers;
β’ Capacity:
β’ Sizing for Non-Outliers => less $
β’ Avoid Predefined Percentiles.
β’ Metrics:
β’ Time (latency)
β’ Rate
β’ Count:
β’ Packets in flight; Packet Loss; % Utilization
β’ Models:
β’ Correlations
β’ Trends in Data
Coming Next
18. User Metrics:
β’ throughput
β’ latency
β’ data loss
β’ data loss & latency
β’ latency & data loss
TheWhirlpool of Metrics
For Monitoring
Real metrics:
β’ # of Packets in Flight
β’ # of Packets in Queue or Lost
For Capacity Planning
Traditional Metrics in Planning:
β’ πβπππ’πβππ’π‘ [Gbps]
β’ πΏππ‘ππππ¦hijklm
β’ πΏππ‘ππππ¦hinl = πππππππ || ππππ π πππππ¦
β’ Packets get queued and blocked.
β’ Bits may be bursty while packets are smooth.
β’ Reverse statement is true as well.
β’ πΏππ‘ππππ¦hijklm = π‘ππππ ππππ‘ + ππ’ππ’ππππ
β’ Packets need capacity.
β’ Packet sizes vary => capacity [Gbps]
Example:
320 πΊπππ = 26.7π β
1.5ππ΅ β 8 πππ‘π
π ππ
320 πΊπππ = 10 β
4πΊππ΅ β 8 πππ‘π
π ππ
Traditional Metrics for Monitoring:
β’ % Utilization
β’ Packet Loss Rate
19. Time, Rate, Count, and Utilization
π π = πΈππππππΆ (π, πΆ)
Packet Queueing:
π π = πΈππππππ΅ (π, πΆ)
Packet Blocking:
2012 paper
π = πππ β πΏππ‘ππππ¦
πΏππ‘ππππ¦ =
1
2
β π ππ + πβ¬β’ββ
πππ = πππ β
πππ‘π
ππππππ‘
πΆ = ππ π₯ πππ β
1
2
β π ππ
2006 paper
(CPU centric)
β’ Utilization CAN BE useless
β’ If the metric does not
reflect what it is used for.
links were utilized
near 100% [
β¬hΖ
β¬hΖ
] but
no packet drops
20. β’ There is no βone size fits allβ approach.
β’ There is no βone size fits allβ statistic.
β’ There are common principles:
β’ Aggregate before computing percentiles.
β’ Use a statistic that accounts for the bulk of the data.
β’ Metric = whatβs important to the BW user:
β’ Quality of Service (QoS):
β’ Network Latency
β’ Packets lost
β’ Models:
β’ Trend in Data
β’ Correlation
β’ Time-Series Analysis
Coming Next
21. Trend in Data
Performance Monitoring:
β’ Is this βnormalβ behavior?
β’ Will this trend continue?
β’ High values will be marked as outliers
β’ Are they?
22. Dealing with Trends
Option 1:
β’ Fit in a linear regression
β’ If itβs a good fit:
β’ Get distribution of residuals
β’ Add p95 or outlier boundary of
residuals to regression line
Performance Monitoring:
23. This results in:
Option 1:
Linear Regression -> Residuals
allows us to detrend the data and
deal with a stationary proxyβ¦
β¦ IFF:
β’ Residuals are stationary
β’ Residuals are normal
β’ Residuals are homoscedastic
Performance Monitoring
24. If Residuals are Not Normal / Not Homoscedastic?
Linear Regression
does not work
Performance Monitoring:
25. Plan B: Directly Predict %-iles
Option 2 :
Quantile Regression:
rq (Demand ~ Time)
Performance Monitoring:
β’ Requires stationary trends
β’ No need for homoscedasticity
β’ No need for normality
27. Using Quantile Regression Capacity Planning:
Great for most use cases
β’ No need for homoscedasticity
β’ No need for normality
β’ Works for correlatedTime Series
β’ Requires stationary trends
β’ Old and New have same weight
29. Using Quantile Regression Performance Monitoring
Compare Model Prediction Ranges for models
built on Baseline and New data sets.
Quantify the change:
Baseline data
Data Shown Here are Generated Exclusively
for this Hypothetical Example
32. Using Time Series Analysis
ETS (Error,Trend, Seasonality) Decomposition
For Performance Monitoring:
1. Fit a Forecasting (ETS) Model
2. Get residuals
3. Identify & Interpret Outliers in Residuals
4. Interpolate (or Predict) Outliers
5. Re-fit Forecasting Model
6. Predict Using the Fitted model
For Capacity Planning:
33. TSA Forecasting
Autoregressive(ARIMA) || ETS (EWMA) Forecasting
1. Fit a Forecasting Model
2. Get residuals
3. Identify Outliers
4. Interpolate (or Predict) Outliers
5. Re-fit Forecasting Model
6. Predict Using the Fitted model
34. Issues / Problems / Challenges
How can we account for these variabilities?
β’ Underlying services have their own plans:
β’ Growth
β’ Deprecation
β’ Relocation
β’ Supporting infrastructure has its own lifecycle:
β’ New Product Introduction
β’ Implementation and Growth
β’ Depreciation
β’ Tech Refresh
β’ Topologies and policies change in time
β’ Change in policies and topology can lead to
changes in demand
35. Possible Solutions:
Flow Level
1. Bottom-Up:
β’ Forecast each service individually;
β’ Follow up with Monte-Carlo aggregation
ππ£π1
ππ£π2
ππ£π3
Possible Problem:
Different prediction intervals not indicative of different data variability
Advantages:
β’ Each serviceβs trend and variability is accounted for.
β’ Each serviceβs growth plans are easy to account for.
πΉπππ€1
36. 1. Bottom-Up:
Forecast each service individually;
follow up with Monte-Carlo aggregation
Possible Problem:
Different prediction intervals not indicative of different data variability
Possible Solutions:
Flow Level
ππ£π1
ππ£π2
ππ£π3
37. 2.Top-Down:
β’ Forecast the flow.
β’ Get Distribution of each componentβs weight in the flow.
β’ Compute each componentβs demand forecast
Possible Problems:
ComponentWeights can drift in time
Interaction and Contention => βunknown unknownβ
Possible Solutions:
Flow Level
Solutions:
Estimate ComponentWeights
Account for Quantile Compression
ππ£π1
ππ£π2
ππ£π3
ππ£π3 ππ£π2
ππ£π1
38. Flow-Level TSA Forecasting
Autoregressive(ARIMA) || ETS (EWMA) Forecasting
1. Fit a Forecasting Model
2. Get residuals
3. Identify Outliers
4. Interpolate (or Predict) Outliers
5. Re-fit Forecasting Model
6. Predict Using the Fitted model
For Capacity Planning:
β’ TSA is NOT theWhole Story:
β’ Business Growth is not accounted for
39. Flow-Level Top-Down Stochastic Problem
Problem:
1. Flow composition varies from day to day.
2. Flow composition also varies within a day.
3. Old components may not be relevant anymore.
4. New components may not have enough history.
41. Stochastic Problem Solution
For each Flow
For each
Service
Identify Services
active in this Flow
Compute Stats:
lower_bound
min
p05
p10
p25
p50
Mean
StDev
p75
p90
p95
p99
max
upper_bound
For each
Hour
Forecast demand
42. For each Flow
For each
Service
Identify Services
active in this Flow
Compute Stats:
lower_bound
min
p05
p10
p25
p50
Mean
StDev
p75
p90
p95
p99
max
upper_bound
For each
Hour
Compute this Svcβs
weight for this Stat
For each Stat
(long-term means)
Infer unconstrained
weights
(use long-term skew)
Forecast demand
Top-Down Forecasting Stochastic Problem Solution
43. For each Flow
For each
Service
Identify Services
active in this Flow
Compute Stats:
lower_bound
min
p05
p10
p25
p50
Mean
StDev
p75
p90
p95
p99
max
upper_bound
For each
Hour
Compute this Svcβs
weight for this Stat
For each Stat
(long-term means)
Infer unconstrained
weights
(use long-term skew)
Forecast demand
πΉππ π‘βΛβ°Ε β ππππβπ‘ΖΕj
Solution to Top-Down Forecasting Stochastic Problem
44. This Solves Most of the Problems
β’ Underlying services have their own plans :
β’ Growth
β’ Deprecation
β’ Relocation
β’ USE PER-SERVICE DEPENDENCIES
β’ Supporting infrastructure has its own lifecycle:
β’ New Product Introduction
β’ Implementation and Growth
β’ Depreciation
β’ Tech Refresh
β’ USE PER-SERVICE /PER-FLOW DEPENDENCIES
β’ Topologies and policies change in time
β’ Change in policies and topology can lead to
changes in demand
β’ USE DUMMY VARIABLES
Now we can account for these variabilities!
Usefulness depends on:
β’ Aggregation of Data
β’ StatMuxing?
β’ Peak Hour?
β’ Hourly Stats?
45. β’ Forecast Demand based on the Model
β’ Bottoms-Up
β’ πππππππ β πππ Drives π·πΆ πΏπππ Drives πππππ & πππ€ππ
β’ Account for QoS in Demand Forecasting
β’ Plan for SLO
β’ DO NOT Assume Anything!
β’ Especially about Shapes of Distributions.
β’ Mean andVariance are Overrated!
β’ So is π95!
β’ Use Outlier Boundaries (βfencesβ)
β’ Size Systems for βwould-be-unboundedβ forecasts
β’ DO Use Entire Distribution to be Proactive
47. All data in this presentation are generated solely for illustration purposes
Select images and formulae are provided with permission from Facebook
48.
49. Capacity Planning
Is the number of Gbps on a constrained system indicative of demand?
Is it right to forecast upper bound of traffic on a constrained system?
β’ Use π25, π50, and π75 to compute the ππππ€
β’ Forecast π25 and π50
β’ Use the ππππ€ to infer forecast of π75β’Ε½jβ°Ε½Ζmβ’iGΕ½lβ’
β’ Compute the forecast of ππππππ΅ππ’ππ
Resource Constraint =>
Quantile Compression =>
Underforecasting the load =>
Undersizing the resource
Account for Quantile Compression