Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our User Agreement and Privacy Policy.

Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our Privacy Policy and User Agreement for details.

Successfully reported this slideshow.

Like this presentation? Why not share!

- What to Upload to SlideShare by SlideShare 6935765 views
- Customer Code: Creating a Company C... by HubSpot 5103617 views
- Be A Great Product Leader (Amplify,... by Adam Nash 1147140 views
- Trillion Dollar Coach Book (Bill Ca... by Eric Schmidt 1337305 views
- APIdays Paris 2019 - Innovation @ s... by apidays 1613058 views
- A few thoughts on work life-balance by Wim Vanderbauwhede 1175746 views

When sizing any network capacity, several factors, such as Traffic, Quality of Service (QoS), and Total Cost of Ownership (TCO) are usually taken into account. Generally, it boils down to a joint minimization of cost and maximization of traffic subject to the constraints of protocol and QoS requirements. The stochastic nature of network traffic and the link saturation queueing issues add uncertainty to the already complex optimization problem. In this paper, we examine the sources of traffic demand variability and dive into Monte-Carlo methodology as an efficient way for solving these problems.

No Downloads

Total views

536

On SlideShare

0

From Embeds

0

Number of Embeds

34

Shares

0

Downloads

9

Comments

8

Likes

1

No notes for slide

- 1. Sources of Traffic Demand Variability and Use of Monte Carlo for Network Capacity Planning Performance and Capacity 2014 by CMG November 05, 2014 Alex Gilgur & Brian Eck Views and opinions expressed in this presentation are views and opinions of its authors. If found to be in contradiction with views and policies of Google, Inc., the latter take precedence. Select images are reproduced with permission from Google, Inc.
- 2. Moore’s Law in Reverse: Drinking from a firehose? http://www.kpcb.com/internet-trends $
- 3. …………... “Matter and energy had ended and with it, space and time... “All collected data had come to a final end. Nothing was left to be collected. “But all collected data had yet to be completely correlated and put together in all possible relationships. “A timeless interval was spent in doing that. “And it came to pass that AC learned how to reverse the direction of entropy. “But there was now no man to whom AC might give the answer of the last question.” Isaac Asimov. “The Last Question”. 1956 What does it cost to own a network? “... ‘THERE IS AS YET INSUFFICIENT DATA FOR A MEANINGFUL ANSWER.’”
- 4. What does it cost to own a network? We don’t have the time for all this! Guesstimate!
- 5. What does it cost to own a network? Ahah! But how sure are you? It depends on: ● number of servers ● topology ● policies ● traffic patterns ● network protocols
- 6. What does a network cost? What is the confidence interval of your “guesstimate” of Total Cost of Ownership of a network? Network Cost Demand Topology Policies Construction Node & Link Reliability The Fishbone Diagram Hardware & Software
- 7. Sizing the Network Network Cost Demand Topology Policies Construction Node & Link Reliability Hardware & Software Network SIZE Network Cost Network size is where we bring value
- 8. Network SIZE TopologyDemand Node & Link Reliability Demand Fishbone
- 9. Demand Fishbone Demand UsageQoS Topology Destination Source Guarantees Latency Flow
- 10. Demand Variability ● Noise & Gaps in data ● Non-stationarity & Outliers ● Variation by O & D Nodes o Node A o Node Z ● Variation by QoS o latency o Pr{delivery} ● Variation within QoS o other factors ● Distribution: Bursty Wide Amplitude Complex Patterns Congestion Control
- 11. Demand Forecastability: Noise & Gaps ● Noise & Gaps in data ● Non-stationarity & Outliers ● Variation by O & D Nodes o Node A o Node Z ● Variation by QoS o latency o Pr{delivery} ● Variation within QoS o other factors ● Distribution: o “from feast to famine” o Bursts o Congestion Control
- 12. Demand Forecastability: Non-Stationarity ● Noise & Gaps in data ● Non-stationarity & Outliers ● Variation by O & D Nodes o Node A o Node Z ● Variation by QoS o latency o Pr{delivery} ● Variation within QoS o other factors ● Distribution: Bursty Wide Amplitude Complex Patterns Congestion Control
- 13. Demand Variability: Non-stationarity ● Noise & Gaps in data ● Non-stationarity & Outliers ● Variation by O & D Nodes o Node A o Node Z ● Variation by QoS o latency o Pr{delivery} ● Variation within QoS o other factors ● Distribution: Bursty Wide Amplitude Complex Patterns Congestion Control
- 14. Demand Variability: QoS Variation SC1 SC2 ● Noise & Gaps in data ● Non-stationarity & Outliers ● Variation by O & D Nodes o Node A o Node Z ● Variation by QoS o latency o Pr{delivery} ● Variation within QoS o other factors ● Distribution: Bursty Wide Amplitude Complex Patterns Congestion Control
- 15. Demand Variability: Other Factors ● Noise & Gaps in data ● Non-stationarity & Outliers ● Variation by O & D Nodes o Node A o Node Z ● Variation by QoS o latency o Pr{delivery} ● Variation within QoS o other factors ● Distribution: Bursty Wide Amplitude Complex Patterns Congestion Control
- 16. Demand Variability: Signal Distribution ● Noise & Gaps in data ● Non-stationarity & Outliers ● Variation by O & D Nodes o Node A o Node Z ● Variation by QoS o latency o Pr{delivery} ● Variation within QoS o other factors ● Distribution Bursty Wide Amplitude Complex Patterns Congestion Control
- 17. Demand Predictability ● Not all forecasting tools were created equal: ○ Non-Gaussian distributions ○ Non-stationarity ○ Congestion Control “All models are wrong. Some models are useful” - G.E.P. Box ● TSA is not the only way to forecast Demand: ○ Explanatory variables: ■ Timestamp is one of them ■ Power ■ CPU ■ Business Metrics Forecast
- 18. From Demand to Capacity Demand QoS Topology Capacity
- 19. QoS = what’s important to user 1. QoS = 1 / Latency 2. QoS = “Goodput” = Throughput * Pr{delivery} 1. Low Latency 2. High Probability of: a. Delivery b. Accuracy
- 20. Find shortest path from Node 1 to Node 2 Routing for Low Latency: SPF: “Travelling Salesman” 4 = Node 4 2 = “Latency of this link = 2 units” Cost = Latency QoS = 1/Cost = 1/Latency
- 21. Find shortest path from Node 1 to Node 2 IF Node 4 is down Cost = Latency QoS = 1/Cost = 1/Latency Find shortest path from Node 1 to Node 2 4 = Node 4 2 = “Latency of this link = 2 units” Routing for Low Latency: SPF: “Travelling Salesman”
- 22. Find shortest path from Node 1 to Node 2 IF Node 4 is down ... … and Link 3-5 is losing packetsCost = Latency QoS = 1/Cost = 1/Latency Find shortest path from Node 1 to Node 2 4 = Node 4 2 = “Latency of this link = 2 units” Routing for Low Latency: SPF: “Travelling Salesman”
- 23. QoS = what’s important to user 1. QoS = 1 / Latency 2. QoS = “Goodput” = Throughput * Pr{delivery} 1. Low Latency 2. High Probability of: a. Delivery b. Accuracy
- 24. “Travelling Salesman” Non-linear optimization Routing for “Goodput”: Nonlinear optimization
- 25. “Travelling Salesman” Non-linear optimization Routing for “Goodput”: Nonlinear optimization
- 26. Non-linear optimization Routing for “Goodput”: Can it be simplified? Assume: ● No Queueing ○ No Blocking Redefine: Can be pseudo-linearized
- 27. Routing As a Process SPF
- 28. SPF Routing As a Process Draining
- 29. SPF Routing As a Process
- 30. SPF Routing As a Process Draining
- 31. SPF Routing As a Process
- 32. SPF Routing As a Process Draining
- 33. SPF Routing As a Process
- 34. SPF Routing As a Process Draining
- 35. SPF Routing As a Process
- 36. “Whack-a-Mole!” Routing is updated all the time via: ● Protocol (e.g., TCP) ● SDN Control We need to accommodate each Flow’s: ● Primary Paths ● Alternative Paths
- 37. Network Demand & Throughput Link Throughput Demand Topology Node & Link Reliability Link Size
- 38. Demandi Throughputj Connex Traversal Time (Latency) Concurrencyj Capacity From Demand to Capacity:
- 39. Demandi Throughputj Link Traversal Time (Latency) Concurrencyj Erl-1 (N, PB) Capacity QoS PB To account for Queueing & StatMux, …
- 40. Demand Throughput Concurrency for Flowi Connex Traversal Time (Latency) Capacity For Long-Haul Networks, it reduced to… LPropagation >> LQueueing Erl-1 (N, PB) QoS PB
- 41. Demand Throughput Capacity Bandwidth Fill Factor For Long-Haul Network, it reduced to… Can’t forget the stochastic element LPropagation >> LQueueing Latency ~ const Concurrency = const * Throughput
- 42. We can forecast demand Demand: ● A1 -> Z1 : X11 Gbps ● A1 -> Z2 : X12 Gbps ● A2 -> Z3 : X23 Gbps Throughput on each Link Capacity for each Link
- 43. We can forecast demand Demand: ● A1 -> Z1 : X11 Gbps ● A1 -> Z2 : X12 Gbps ● A2 -> Z3 : X23 Gbps Throughput on each Link Capacity for each Link Throughput is combinatorial
- 44. Demand is NOT Deterministic Demand: ● A1 -> Z1 : X11 Gbps ● A1 -> Z2 : X12 Gbps ● A2 -> Z3 : X23 Gbps Throughput on each Link Neither is Throughput
- 45. Throughput: L12 = ? L24 = ? L43 = ? L31 = ? L141 = ? Demand: N1_N4: 100 Gbps N2_N4: 200 Gbps 100 G 100 G 200 G 100 G 200 G 200 G Throughput: L12 = 100 G L21 = 200 G L24 = 300 G L14 = 300 G L41 = 0 L43 = 0 L31 = 0 N1 N2 N3 N4 L31 L43 L24 L12 L141 5 315 25 22 From Deterministic Demand to Throughput
- 46. From Gaussian Demand to Throughput: Throughput: L12 = ? L24 = ? L43 = ? L31 = ? L141 = ? Demand: N1_N4: N (100, 10) Gbps N2_N4: N (200, 15) Gbps Throughput: L12 = N (100, 10) G L21 = N (200, 15) G L24 = N (300, 18) G L14 = N (300, 18) G L41 = 0 L43 = 0 L31 = 0 N1 N2 N3 N4 L31 L43 L24 L12 L141 5 315 25 22
- 47. Throughput: L12 = ? L24 = ? L43 = ? L31 = ? L141 = ? Demand: N1_N4: G (100, ...) Gbps N2_N4: G (200, ...) Gbps N1 N2 N3 N4 L31 L43 L24 L12 L141 5 315 25 22 ? From Generic Random Demand to Throughput:
- 48. Monte-Carlo
- 49. Monte-Carlo
- 50. Monte-Carlo
- 51. Every Demand VALUE is a REALIZATION of a RANGE of possible values Demand Forecast Replace point estimates with probability distributions
- 52. Link Throughput: Monte-Carlo Forecasting Replace point estimates with probability distributions Slice the timeline For each timestamp: For each Flow: roll the dice N times For each timestamp: For each of the N dice rolls: Throughput = sum (Flows)
- 53. Monte Carlo works with any Transfer Function Monte Carlo Throughput on each Link Demand (A-Z) Capacity for each Link
- 54. Use Case (a case study) ● Hundreds of links ● Thousands of demand flows forecasted o 95th percentile o Unspecified Prediction Intervals ● Establish optimal Inventory Size & Policies o Account for Demand Predictability ● Estimate demand variability effect on: o Network Size o TCO Forecast
- 55. Approach Quantify Demand Distributions (use Biases) Use Monte-Carlo to forecast Throughput Distributions Use Monte-Carlo to compute Capacity Predictive Intervals Use Monte-Carlo to optimize Inventory Size & Policies Biases = Forecast - Observed Biases != Residuals
- 56. Quantify Demand Ranges & Prepare MC “Forecasts” Start For Each Time Slice For Each Flow Compute: Bias = Projected - Observed Build: Bias Distribution Roll the dice N = 100 times Apply the rolled-out numbers to the baseline forecast for each flow Save the N Demand scenarios
- 57. Run the Pseudo-Random Demands through MC Map1 Map2 MapN MapN-1 Reduce F flows * N forecasts Map: Compute Capacities (N) Reduce: Analyze the N Capacity Forecasts L links: Capacity Prediction Intervals Capacity Forecasts for each Link
- 58. What does it cost to own a network?
- 59. ● Range forecasting is cool! ● Network Demand varies in many ways ● For WAN, it is OK to use throughput o still it’s better to use concurrency ● Demand ≠ Throughput o Demand -> Throughput -> Capacity ● Monte-Carlo is a model o Therefore it is wrong o But it is useful In Conclusion
- 60. Acknowledgements ● Google’s NetOps Division ● Google’s NetCap & ODS Teams ● Josep Ferrandiz ● Mike Perka ● Leonid Kats ● C. Steven Gunn ● Matthew Mathis ● Kevin J. Mitchell ● Linda Eck ● Sophia Shtilman ● Leora Gilgur
- 61. agilgur@google.com brianeck@google.com THANK YOU!!!
- 62. Backup Slides
- 63. Biases != Residuals. Why? How good are forecasts at predicting demand N days from “now” ???
- 64. H/W Availability: Fault Trees Reliability Function: Failure is a memoryless (Poisson) process F(C|t) = F ((1 OR 2)|t) = 1- (R(1|t) * R(2|t)) F(D|t) = F ((3 AND 4 AND 5)|t) = F(3|t) * F(4|t) * F(5|t) F(E|t) = F ((7 AND 8) | t) = F (7|t) * F(8|t) F(F|t) = F ((6 OR E) | t) = 1 - (1 - F(7|t) * F(8|t)) * R(6|t) F(B|t) = F ((C OR D OR F)|t) = 1 - R(1|t) * R(2|t) * (1-F(3|t) * F(4|t) * F(5|t)) * (1-F(7|t) * F(8|t)) * R(6|t) ⇒ R(A|t) = R(1|t) * R(2|t) * (1-F(3|t) * F(4|t) * F(5|t)) * (1-F(7|t) * F(8|t)) * R(6|t) C D F E B There’s got to be a cleaner way!
- 65. Fault Trees and Monte-Carlo C D F E B clock.start() for each component: component.update (time = clock) clock.set (min (next_update_time)) Component state = (run, fail) rule = (AND, OR, NONE) mtbf mttr next_update_time elements: Component fail() run() update(time) run(): if rule == NONE: state = run; else: //apply rule to elements return; fail(): if rule == NONE: state = fail; else: //apply rule to elements return; update (time): if time ≥ next_update_time: if state == fail: run(); next_update_time +=Exp(mtbf); else: fail(); next_update_time +=Exp(mttr); return;
- 66. Probability distributions Simplest - Uniform: Least relevant to anything real Convenient building block for any distribution Most standard - Gaussian: Mathematically the simplest Does not describe the IT world Most Relevant - Poisson & Exponential): Relatively simple mathematically Accurately describes times between arrivals and service times for a memoryless process. F(x) = Pr (X ≤ x) - CDF f (x) = F’(x) - PDF

No public clipboards found for this slide

Login to see the comments